Odd little problem here,
I have this (random) sentence in Bengali : "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।"
I tried to run a regex on it (using Python re library) like this :
- সুগঠিত ("token #4") :
re.search(r"\bসুগঠিত\b", "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।") : <re.Match object; span=(19, 25), match='সুগঠিত'>
- কবিতা ("token #2"):
re.search(r"\bকবিতা\b", "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।"): None
Any idea why this might be happening?
UPDATE (from answer suggestions below) :
- Check out the Diacritics used in Bengali (and other Indic languages)
If you check what chars your
কবিতাconsists of (I like to use this service), you will learn that the last letter is a U+09BE, that is a BENGALI VOWEL SIGNAAthat belongs to the Mc (Mark, spacing combining) Unicode category.Note that Mc Unicode category chars does not belong to the word chars in
reregex. Pythonre\wmatches "Unicode letter, ideogram, digit, or underscore", where "ideogram" refers to the Mn (Mark, Nonspacing) Unicode category only.The last
\bword boundary in your regex requires either the end of string, or a non-word char immediately after theAAvowel, because the word boundary appears right after a non-wordAAchar.Thus, if you need to add all combining marks into the word boundary, you would need to use the PyPi regex library where the issue is fixed:
See the Python demo online: