Here's my spark sql version: 3.1.3
SELECT
count(
DISTINCT test_table.name
) AS cnt
FROM
test_table
WHERE
test_table.inv_date >= '20230101'
AND test_table.inv_date < '20230601'
AND (
test_table.name NOT IN (
SELECT
DISTINCT test_table.name AS name
FROM
test_table
WHERE
test_table.inv_date >= '20230101'
AND test_table.inv_date < '20230501'
)
)
Here's the spark application UI.
I would like to ask why the query contains "NOT IN" still do broadcast join even though I have disabled it?
I disabled Broadcast Hash Join (set spark.sql.autoBroadcastJoinThreshold to -1), but the following query still did broadcast.