I need to detect text with Unicode characters restricting it to letters only (e.g. no symbols, emojis, etc., just something that can be used in a person's name in any Unicode language). The \p{L} category seems to do the trick, but it does not recognize Thai strings. I do not speak Thai, so I got a few common Thai names from ChatGPT and they all fail in my test. Tried it at RegExr (see the Tests tab) and also wrote a simple test program:
using System.Text.RegularExpressions;
Console.OutputEncoding = System.Text.Encoding.UTF8;
string pattern = @"^[\p{L}\s]+$";
string englishText = "Mary";
Console.Write($"{englishText}: ");
Console.WriteLine(Regex.IsMatch(englishText, pattern, RegexOptions.IgnoreCase).ToString()); // true
string germanText = "RöschenÜmit";
Console.Write($"{germanText}: ");
Console.WriteLine(Regex.IsMatch(germanText, pattern, RegexOptions.IgnoreCase).ToString()); // true
string thaiText = "อรุณรัตน์";
Console.Write($"{thaiText}: ");
Console.WriteLine(Regex.IsMatch(thaiText, pattern, RegexOptions.IgnoreCase).ToString()); // false
string japaneseText = "タクミたくみく";
Console.Write($"{japaneseText }: ");
Console.WriteLine(Regex.IsMatch(japaneseText, pattern, RegexOptions.IgnoreCase).ToString()); // true
I noticed when I try testing each individual character in the Thai string, it seems to recognize them as valid Unicode letters, but as a string, it fails. Just to make sure I do not have any hidden characters, I checked the raw values and I did not see anything suspicious. Any ideas what's going on here?
P.S. I know that some of the characters in the test are from different sets and names may include spaces, dashes, etc., but this is not the point. I'm just trying to solve the Thai strings issue here.
COMMENT: Thai string contains combining character which I guess causes the problem in detecting letters even if those look as single letter (i.e. {0e23, 0xe38} results in "รุ").
If we print out
thaiTextdump:We'll get the cause of misbehaviour:
NonSpacingMarks category between theOtherLetters:Technically, to get rid of these marks we can use normalization:
but it doesn't work at my workstation and the reason is an issue
So if normalization doesn't work in your case as well (or you want to be on the safer side of the road), you can try match Thai symbols; either only Thai
or mixing with all the other ones (letters or Thai letters as a special case):
or allow both letters (
\p{L}) and these non-spacing marks (\p{Mn}):