PySpark Return Exact Match from list of strings

3.2k views Asked by At

I have a dataset as follows:

| id | text |
--------------
| 01 | hello world |
| 02 | this place is hell |

I also have a list of keywords I'm search for: Keywords = ['hell', 'horrible', 'sucks']

When using the following solution using .rlike() or .contains(), sentences with either partial and exact matches to the list of words are returned to be true. I would like only exact matches to be returned.

Current code:


KEYWORDS = 'hell|horrible|sucks'
df = (
            df
            .select(
                F.col('id'),
                F.col('text'),
                F.when(F.col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
            )
)

Current output:

| id | text | keyword_found |
-------------------------------
| 01 | hello world | 1 |
| 02 | this place is hell | 1 |

Expected output:

| id | text | keyword_found |
--------------------------------
| 01 | hello world | 0 |
| 02 | this place is hell | 1 |
2

There are 2 answers

4
Mahesh Gupta On

Try below code, I have just change the Keyword only :

from pyspark.sql.functions import col,when


data = [["01","hello world"],["02","this place is hell"]]
schema =["id","text"]
df2 = spark.createDataFrame(data, schema)
df2.show()
+---+------------------+
| id|              text|
+---+------------------+
| 01|       hello world|
| 02|this place is hell|
+---+------------------+

KEYWORDS = '(hell|horrible|sucks)$'

df = (
            df2
            .select(
                col('id'),
                col('text'),
                when(col('text').rlike(KEYWORDS), 1).otherwise(0).alias('keyword_found')
            )
)
df.show()

+---+------------------+-------------+
| id|              text|keyword_found|
+---+------------------+-------------+
| 01|       hello world|            0|
| 02|this place is hell|            1|
+---+------------------+-------------+

Let me know if you need more help on this.

0
Sudhin On

This should work

Keywords = 'hell|horrible|sucks'

df = (df.select(F.col('id'),F.col('text'),F.when(F.col('text').rlike('('+Keywords+')(\s|$)').otherwise(0).alias('keyword_found')))
id text keyword_found
01 hello world 0
02 this place is hell 1