In pyspark 2.4, how to handle columns with the same name resulting of a self join?

262 views Asked by At

Using pyspark 2.4, I am doing a left join of a dataframe on itself.

df = df.alias("t1") \
    .join(df.alias("t2"),
          col(t1_anc_ref) == col(t2_anc_ref), "left")

The resulting structure of this join is the following:

root
 |-- anc_ref_1: string (nullable = true)
 |-- anc_ref_2: string (nullable = true)
 |-- anc_ref_1: string (nullable = true)
 |-- anc_ref_2: string (nullable = true)

I would like to be able to drop the penultimate column of this dataframe (anc_ref_1).

Using the column name is not possible, as there are duplicates. So instead of this, I select the column by index and then try to drop it:

col_to_drop = len(df.columns) - 2
df= df.drop(df[col_to_drop])

However, that gives me the following error:

pyspark.sql.utils.AnalysisException: "Reference 'anc_ref_1' is ambiguous, could be: t1.anc_ref_1, t2.anc_ref_1.;"

Question:

When I print the schema, there is no mention of t1 and t2 in column names. Yet it is mentionned in the stack trace. Why is that and can I use it to reference a column ?

I tried df.drop("t2.anc_ref_1") but it had no effect (no column dropped)

EDIT: Works well with df.drop(col("t2.anc_ref_1"))

How can I handle the duplicate column names ? I would like to rename/drop so that the result is:

root
 |-- anc_ref_1: string (nullable = true)
 |-- anc_ref_2: string (nullable = true)
 |-- anc_ref_1: string (nullable = true) -> dropped
 |-- anc_ref_2: string (nullable = true) -> renamed to anc_ref_3
1

There are 1 answers

0
wwnde On

Option1

drop the column by referring to the original source dataframe. Data

df= spark.createDataFrame([ ( 'Value1',           'Something'),
     ('Value2',     '1057873 1057887'),
     ('Value3', 'Something Something'),
     ('Value4',                None),
   ( 'Value5',             '13139'),
    ( 'Value6',     '1463451 1463485'),
   ( 'Value7',     'Not In Database'),
    ( 'Value8',     '1617275 16288') 
  
],( 'anc_ref_1',            'anc_ref'))
  
  df.show()

Code

df_as1 = df.alias("df_as1")
df_as2 = df.alias("df_as2")
df1 = df_as1.join(df_as2, df_as1.anc_ref == df_as2.anc_ref, "left").drop(df_as1.anc_ref_1)#.drop(df_as2.anc_ref)

df1.show()

Option 2 Use a string sequence to join and then select the join column

df_as1.join(df_as2, "anc_ref", "left").select('anc_ref',df_as1.anc_ref_1).show()