Given a homoglyph, I want a Rust function to convert it to the nearest ASCII character.
All of these Unicode "a"s
A Α А Ꭺ ᗅ ᴀ ꓮ A
should be converted to:
a a a a a a a a a a a a a a a a a a a a a a a a a a a a a
I tried this but it didn't work:
let input = "A Α А Ꭺ ᗅ ᴀ ꓮ A ";
let normalized = input.nfc().collect::<String>(); // normalize using NFC
let result = normalized.to_lowercase(); // convert to lower case
println!("{}", result);
It output:
a α а ꭺ ᗅ ᴀ ꓮ a
I assume you
use unicode_normalization::UnicodeNormalization;for.nfc()? (Always nice to mention these things.)According to the relevant standard annex, that will only do "Canonical Decomposition, followed by Canonical Composition". From what I understand of the jargon, that means it will only change how grapheme clusters are represented by characters, but not how they're supposed to be rendered. What you want is probably the "Compatibility Decomposition", which, as indicated here, includes substitutions like
ℌ → H. The Compatibility Decomposition is available through.nfkc()or.nfkd()in theunicode_normalizationcrate.