Transliterate ambiguous unicode characters visually instead of phonetically

550 views Asked by At

How do I replace unicode characters that look like latin characters with ASCII equivalents?

Trying to make a chat filtering program, but people can circumvent it by using unusual characters that are still human readable.

I have tried libraries like unidecode or anyascii, however they seem to prioritize phonetically similar characters instead of visually similar ones.

Example

Ideally it should take care of all kinds of different zalgo or "font" characters, but here is an example with cyrillic letters:

s = "FRЕЕ DISСОRD NIТRO" # contains some cyrillic letters (Е,С,О,Т)
print(unidecode(s))      # FREE DISSORD NITRO (phonetically transliterated)

This obviously again would break chat filters since it would look for "DISCORD" instead of "DISSORD".

Potential solution?

VSCode seems to use a really effective library to detect these characters, but I was unable to find out how exactly they do it. I only found that vscode uses textmate for syntax highlighting. As you can see it can highlight these "ambiguous unicode characters" and also suggest a replacement character:

vscode ambiguous character detection feature

Can I use this VSCode library to filter my own strings?

2

There are 2 answers

0
Rob Napier On

The technical term you're looking for is a confusable. Unicode provides a nice sample app (along with a Java implementation) in their utilities, and a data file describing the official list of confusables.

Looking at the VSCode source, it looks like Microsoft uses a package vscode-unicode-data to generate a dictionary of ambiguous characters from the Unicode data, along with a few hand-written additions.

0
gnasher729 On

As an example, uppercase A and uppercase Alpha look identical. Ignoring how you do this, one strategy to fix this safely is to take all input at the earliest possible time, and turn it into acceptable input, throwing the original away.

I would do it in this order: 1. Throw away any invalid utf-8 sequence, so a sequence of code points has exactly one representation. 2. Throw away any invalid code points including byte order markers at the start of the text. Now you have a sequence of valid code points. 3. Put all multi-Unicode characters into a canonical form. Best the fully decomposed form. Now every sequence of characters has the same form. 4. Replace confusable characters with the most common character. Now you have no different character sequences that look the same.

You need to be careful with spelling checkers for example that might change a Greek word starting with a Latin A back to Greek. You could fix this by making the character replacement context dependent. For example pick capital Alpha if the letter is surrounded by Greek characters, as long as you can’t have two different sequences of bytes that are displayed the same.