Using pyspark 2.4, I am doing a left join of a dataframe on itself.
df = df.alias("t1") \
.join(df.alias("t2"),
col(t1_anc_ref) == col(t2_anc_ref), "left")
The resulting structure of this join is the following:
root
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
I would like to be able to drop the penultimate column of this dataframe (anc_ref_1).
Using the column name is not possible, as there are duplicates. So instead of this, I select the column by index and then try to drop it:
col_to_drop = len(df.columns) - 2
df= df.drop(df[col_to_drop])
However, that gives me the following error:
pyspark.sql.utils.AnalysisException: "Reference 'anc_ref_1' is ambiguous, could be: t1.anc_ref_1, t2.anc_ref_1.;"
Question:
When I print the schema, there is no mention of t1 and t2 in column names. Yet it is mentionned in the stack trace. Why is that and can I use it to reference a column ?
I tried df.drop("t2.anc_ref_1")
but it had no effect (no column dropped)
EDIT: Works well with df.drop(col("t2.anc_ref_1"))
How can I handle the duplicate column names ? I would like to rename/drop so that the result is:
root
|-- anc_ref_1: string (nullable = true)
|-- anc_ref_2: string (nullable = true)
|-- anc_ref_1: string (nullable = true) -> dropped
|-- anc_ref_2: string (nullable = true) -> renamed to anc_ref_3
Option1
drop the column by referring to the original source dataframe. Data
Code
Option 2 Use a string sequence to join and then select the join column