Create a token pattern based on the concatenation of some given words

42 views Asked by At

I have a list of keywords, for example:

keywords = ['airbnb.com', 'booking', 'deliveroo.uk - UK', ...]

My goal is to define the parameter token_pattern of CountVectorizer by concatenating all keywords.

The idea is this:

token_pattern = '|'.join([pattern_keyword_1, pattern_keyword_2, ...])

What interests me is that it matches the exact occurrences in the text and not the substrings.

For example, if I have 'def.com' in the keywords I DON'T want it to match 'abcdef.com'.

Is it possible to do it?

Thanks in advance.

0

There are 0 answers