Using 'not is in' in PySpark and getting an empty dataframe back

Question

Using 'not is in' in PySpark and getting an empty dataframe back

219 views Asked by Lucas At 01 November 2022 at 07:54

I'm trying to use filter to find those 'title' that are not in list_A.

A = B.groupBy("title").count()
A = A.filter(A['count'] > 1)
A_df = A.toPandas()
list_A = A_df['title'].values.tolist()
B.filter(~B.title.isin(list_A)).count()

However, I get an empty dataframe back (count is 0)

It works well when I use 'is in':

Why this happened and how can I solve this?

I tried:

B=B.na.drop(subset=["title"])
B.filter(~B.title.isin(list_A)).count()
print(B.filter(~B.title.isin(list_A) | B.title.isNull()).count())

It still returns 0.

Original Q&A

There are 1 answers

**ZygD** · Accepted Answer · 2022-11-01T08:22:42+00:00

It may be because other "title" values are null.

B = spark.createDataFrame([('x',), ('x',), (None,)], ['title'])

A = B.groupBy("title").count()
A = A.filter(A['count'] > 1)
A_df = A.toPandas()
list_A = A_df['title'].values.tolist()

print(B.filter(~B.title.isin(list_A)).count())
# 0
print(B.filter(B.title.isin(list_A)).count())
# 2

If you really need list_A, you shouldn't go to Pandas for it.

You can either use collect

A = B.groupBy("title").count().filter(F.col('count') > 1)
list_A = [x.title for x in A.collect()]
print(list_A)
# ['x', None]

or collect_set

list_A = (B
    .groupBy("title").count()
    .groupBy((F.col('count') > 1).alias('_c')).agg(
        F.collect_set('title').alias('_t')
    ).filter('_c')
    .head()[1]
)
print(list_A)
# ['x']

Finally, to translate your current query to PySpark, you should use window functions.

Input:

from pyspark.sql import functions as F, Window as W
B = spark.createDataFrame(
    [('x', 'Example'),
     ('x', 'Example'),
     ('x', 'not_example'),
     ('y', 'not_example'),
     (None, 'not_example'),
     (None, 'Example')],
    ['title', 'journal'])

Your current script:

A = B.groupBy("title").count()
A = A.filter(A['count'] > 1)
A_df = A.toPandas()
list_A = A_df['title'].values.tolist()
B.filter(((B.title.isin(list_A))&(B.journal!="Example"))|(~B.title.isin(list_A)))

Suggestion:

B_filtered = (B
    .withColumn('A_cnt', F.count('title').over(W.partitionBy('title')))
    .filter("(A_cnt > 1 and journal != 'Example') or A_cnt <= 1")
    .drop('A_cnt')
)
B_filtered.show()
# +-----+-----------+
# |title|    journal|
# +-----+-----------+
# | null|not_example|
# | null|    Example|
# |    x|not_example|
# |    y|not_example|
# +-----+-----------+

TechQA.

Using 'not is in' in PySpark and getting an empty dataframe back

There are 1 answers

Related Questions in DATAFRAME

Related Questions in APACHE-SPARK

Related Questions in PYSPARK

Related Questions in COUNT

Related Questions in SQL-IN

Popular Questions

Trending Questions