Python regex matches can't match specific word in a Bengali Sentence

Question

Python regex matches can't match specific word in a Bengali Sentence

88 views Asked by appledora At 17 October 2022 at 20:16

Odd little problem here, I have this (random) sentence in Bengali : "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।" I tried to run a regex on it (using Python re library) like this :

সুগঠিত ("token #4") :

re.search(r"\bসুগঠিত\b", "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।") : <re.Match object; span=(19, 25), match='সুগঠিত'>

কবিতা ("token #2"):

re.search(r"\bকবিতা\b", "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।"): None

Any idea why this might be happening?

UPDATE (from answer suggestions below) :

Check out the Diacritics used in Bengali (and other Indic languages)

Original Q&A

There are 2 answers

Mohammad Nasiruddin On 26 April 2023 at 07:32

It works for me:

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import regex

print(regex.search(r"\bসুগঠিত\b", "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।"))
print(regex.search(r"\bকবিতা\b", "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।"))

**Wiktor Stribiżew** · Accepted Answer · 2022-10-18T10:34:18+00:00

If you check what chars your কবিতা consists of (I like to use this service), you will learn that the last letter is a ‎U+09BE, that is a BENGALI VOWEL SIGN AA that belongs to the Mc (Mark, spacing combining) Unicode category.

Note that Mc Unicode category chars does not belong to the word chars in re regex. Python re \w matches "Unicode letter, ideogram, digit, or underscore", where "ideogram" refers to the Mn (Mark, Nonspacing) Unicode category only.

The last \b word boundary in your regex requires either the end of string, or a non-word char immediately after the AA vowel, because the word boundary appears right after a non-word AA char.

Thus, if you need to add all combining marks into the word boundary, you would need to use the PyPi regex library where the issue is fixed:

Definition of 'word' character (issue #1693050)

The definition of a 'word' character has been expanded for Unicode. It conforms to the Unicode specification at http://www.unicode.org/reports/tr29/.

See the Python demo online:

import regex
print( regex.search(r"\bকবিতা\b", "তিনি কবিতা প্রিয়, সুগঠিত স্বাস্থ্যের অধিকারী।") )
# => <regex.Match object; span=(5, 10), match='কবিতা'>

TechQA.

Python regex matches can't match specific word in a Bengali Sentence

There are 2 answers

Related Questions in PYTHON

Related Questions in REGEX

Related Questions in TEXTMATCHING

Popular Questions

Trending Questions