I'm trying to use filter to find those 'title' that are not in list_A.
A = B.groupBy("title").count()
A = A.filter(A['count'] > 1)
A_df = A.toPandas()
list_A = A_df['title'].values.tolist()
B.filter(~B.title.isin(list_A)).count()
However, I get an empty dataframe back (count is 0)
It works well when I use 'is in':
Why this happened and how can I solve this?
I tried:
B=B.na.drop(subset=["title"])
B.filter(~B.title.isin(list_A)).count()
print(B.filter(~B.title.isin(list_A) | B.title.isNull()).count())
It still returns 0.


It may be because other "title" values are null.
If you really need
list_A, you shouldn't go to Pandas for it.You can either use
collector
collect_setFinally, to translate your current query to PySpark, you should use window functions.
Input:
Your current script:
Suggestion: