Missing methods in PySpark 2.4's pyspark.sql.functions but still works in local environment

214 views Asked by At

I'm using PySpark 2.4 and I noticed that the pyspark.sql.functions module is missing some methods like trim and col. In PyCharm, it shows as undefined. However, I have written some tasks using these methods and they run correctly in the local environment of PySpark 2.4, with the expected results. Why is that?

Here is my environment setup:

from pyspark.sql import SparkSession

def create_env():
    spark = SparkSession.builder \
        .appName("HiveTest") \
        .master("local") \
        .config("spark.sql.warehouse.dir", "/user/hive/warehouse") \
        .config("spark.hadoop.hive.metastore.uris", "thrift://master:9083") \
        .config("spark.hadoop.hive.exec.scratchdir", "/user/hive/tmp") \
        .enableHiveSupport() \
        .getOrCreate()
    spark.sparkContext.setLogLevel("ERROR")
    return spark

And here is an excerpt of my SparkSQL code:

df = spark.table("ods.t_ctp20_department_d").select(
    trim(col("departmentid")).alias("branch_id"),
    trim(col("departmentid")).alias("branch_no"),
    trim(col("departmentname")).alias("branch_name"),
    when(trim(col("departmentid")) == 'FU', '00')
    .when(length(trim(col("departmentid"))) == 2, 'FU')
    .when(length(trim(col("departmentid"))) == 4, substring(trim(col("departmentid")), 1, 2))
    .when(length(trim(col("departmentid"))) == 6, substring(trim(col("departmentid")), 1, 4))
    .otherwise(substring(trim(col("departmentid")), 1, 6)).alias("up_branch_no"),
    lit('0').alias("branch_type"),
    lit('00').alias("data_source"),
    col("brokerid").alias("brokers_id"),
    lit(busi_date).alias("ds_date")
)

I tried using the trim and col methods from the pyspark.sql.functions module in my PySpark 2.4 code. Surprisingly, even though my PyCharm IDE highlighted these methods as undefined, my code still executed successfully in the local PySpark 2.4 environment and produced the expected results.

I have a Python script that I run either by executing "python3 xx.python" or by using a remote interpreter in PyCharm. The remote interpreter is set up with only the pyspark2.4 package installed within a virtual environment.

When running the script in PyCharm, everything seems to run fine. However, I encounter an error stating that the function is not defined when accessing the pyspark2.4 API.

I would like to understand the reason behind this error. Is there any additional configuration required in PyCharm when using pyspark2.4? Thank you for your assistance!

2

There are 2 answers

2
shay__ On

This is because col, lit and some other functions are binded dynamically. This goes back to the very early versions of Spark, and looks like it comes to handle versions compatibility.

0
Simon Mau On

I run the following commands in the interpreter to enable IDE to support different versions of the API, and it seems to work with both 2.4 and 3.2 source code prompts:

pip install pyspark==2.4.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install pyspark==3.2.0 -i https://pypi.tuna.tsinghua.edu.cn/simple
pip install pyspark==2.4.0 -i https://pypi.tuna.tsinghua.edu.cn/simple

It seems that the removal of version 3.2 was not clean and actually allows the IDE to support features from both versions. If anyone has encountered a similar issue, you can try this approach.