Replace dataframe with its alias in select in pyspark

42 views Asked by At

I am trying to learn about the dataframe alias command in pyspark. Here is what I observe.

Suppose I have a sample dataframe

t1_df = spark.createDataFrame([['a'], ['b']], 'c1: string')

print(t1_df.show())
+---+
| c1|
+---+
|  a|
|  b|
+---+

Now I have created its alias

t2_df = t1_df.alias('df1')

If I select the second dataframe's column from the first dataframe, it will work fine as follows

t1_df.select(t2_df.c1)

+---+
| c1|
+---+
|  a|
|  b|
+---+

However, if I try the same with alias, it doesn't work

t1_df.select(col('df1.c1'))

AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] A column or function parameter with name `df1`.`c1` cannot be resolved. Did you mean one of the following? [`c1`].;
'Project ['df1.c1]
+- LogicalRDD [c1#3311], false

Why is it so? How does an alias work? I don't have any specific purpose for the following, I am just interested and experimenting with things.

I am using Spark version 3.4.1

2

There are 2 answers

0
Dhruv On BEST ANSWER

The df1 keyword is only recognizable when the dataframe t2_df is in the query. The df1 keyword itself doesn't create anything special and is just added as an extra notation in the metadata of the column.

To see that, we need to build the code in scala and run the following query.

t2_df.select("c1").queryExecution.analyzed.prettyJson
Output

[ {
  "class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
  "num-children" : 1,
  "projectList" : [ [ {
    "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
    "num-children" : 0,
    "name" : "c1",
    "dataType" : {
      "type" : "array",
      "elementType" : "string",
      "containsNull" : true
    },
    "nullable" : true,
    "metadata" : { },
    "exprId" : {
      "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
      "id" : 6,
      "jvmId" : "3271baa8-0b0d-4e6c-bf89-f982cb40a636"
    },
    "qualifier" : "[df1]"         // This is the metadata added
  } ] ],
...
12
s.polam On

The exception message clearly states that the column df1.c1 does not exist in the t1_df DataFrame:

pyspark.errors.exceptions.captured.AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION] The column or function parameter with the name `df1`.`c1` cannot be resolved. Did you mean one of the following? [`c1`]
  • Resolving t2_df.c1 will result in Column<'c1'>, and since the c1 column does exist in the t1_df DataFrame, it executes successfully.
  • Resolving col('df1.c1') will result in Column<'df1.c1'>. However, this column does not exist in the t1_df DataFrame, leading to the exception AnalysisException: [UNRESOLVED_COLUMN.WITH_SUGGESTION].

You can check query plans as both plans are not match.

t1_df.explain(True)

== Parsed Logical Plan ==
LogicalRDD [c1#8], false

== Analyzed Logical Plan ==
c1: string
LogicalRDD [c1#8], false

== Optimized Logical Plan ==
LogicalRDD [c1#8], false

== Physical Plan ==
*(1) Scan ExistingRDD[c1#8]

t2_df.explain(True)

== Parsed Logical Plan ==
SubqueryAlias df1
+- LogicalRDD [c1#8], false

== Analyzed Logical Plan ==
c1: string
SubqueryAlias df1
+- LogicalRDD [c1#8], false

== Optimized Logical Plan ==
LogicalRDD [c1#8], false

== Physical Plan ==
*(1) Scan ExistingRDD[c1#8]

enter image description here