Pyspark: Serialized task exceeds max allowed. Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values

Question

Pyspark: Serialized task exceeds max allowed. Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values

46.4k views Asked by Wendy De Wit At 31 January 2019 at 10:47

I'm doing calculations on a cluster and at the end when I ask summary statistics on my Spark dataframe with df.describe().show() I get an error:

Serialized task 15:0 was 137500581 bytes, which exceeds max allowed: spark.rpc.message.maxSize (134217728 bytes). Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values

In my Spark configuration I already tried to increase the aforementioned parameter:

spark = (SparkSession
         .builder
         .appName("TV segmentation - dataprep for scoring")
         .config("spark.executor.memory", "25G")
         .config("spark.driver.memory", "40G")
         .config("spark.dynamicAllocation.enabled", "true")
         .config("spark.dynamicAllocation.maxExecutors", "12")
         .config("spark.driver.maxResultSize", "3g")
         .config("spark.kryoserializer.buffer.max.mb", "2047mb")
         .config("spark.rpc.message.maxSize", "1000mb")
         .getOrCreate())

I also tried to repartition my dataframe using:

dfscoring=dfscoring.repartition(100)

but still I keep on getting the same error.

My environment: Python 3.5, Anaconda 5.0, Spark 2

How can I avoid this error ?

Original Q&A

There are 6 answers

**Nadia Tomova** · Answer 1 · 2019-03-25T13:51:51+00:00

I had the same issue and it wasted a day of my life that I am never getting back. I am not sure why this is happening, but here is how I made it work for me.

Step 1: Make sure that PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set. Turned out that python in worker(2.6) had a different version than in driver(3.6). You should check if environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON are correctly set.

I fixed it by simply switching my kernel from Python 3 Spark 2.2.0 to Python Spark 2.3.1 in Jupyter. You may have to set it up manually. Here is how to make sure your PySpark is set up correctly https://mortada.net/3-easy-steps-to-set-up-pyspark.html

STEP 2: If that doesn't work, try working around it: This kernel switch worked for DFs that I haven't added any columns to: spark_df -> panda_df -> back_to_spark_df .... but it didn't work on the DFs where I had added 5 extra columns. So what I tried and it worked was the following:

# 1. Select only the new columns: 

    df_write = df[['hotel_id','neg_prob','prob','ipw','auc','brier_score']]


# 2. Convert this DF into Spark DF:



     df_to_spark = spark.createDataFrame(df_write)
     df_to_spark = df_to_spark.repartition(100)
     df_to_spark.registerTempTable('df_to_spark')


# 3. Join it to the rest of your data:

    final = df_to_spark.join(data,'hotel_id')


# 4. Then write the final DF. 

    final.write.saveAsTable('schema_name.table_name',mode='overwrite')

Hope that helps!

**libin** · Answer 2 · 2019-10-08T03:04:47+00:00

i'm in same trouble, then i solve it. the cause is spark.rpc.message.maxSize if default set 128M, you can change it when launch a spark client, i'm work in pyspark and set the value to 1024, so i write like this:

pyspark --master yarn --conf spark.rpc.message.maxSize=1024

solve it.

**Fern** · Answer 3 · 2019-12-26T10:41:54+00:00

Fern On 26 December 2019 at 10:41

I had the same problem but using Watson studio. My solution was:

sc.stop()
configura=SparkConf().set('spark.rpc.message.maxSize','256')
sc=SparkContext.getOrCreate(conf=configura)
spark = SparkSession.builder.getOrCreate()

I hope it help someone...

**Abhijeet Kelkar** · Answer 4 · 2022-08-25T16:55:20+00:00

For those folks, who are looking for AWS Glue script pyspark based way of doing this. The below code snippet might be useful

from awsglue.context import GlueContext
from pyspark.context import SparkContext
from pyspark import SparkConf
myconfig=SparkConf().set('spark.rpc.message.maxSize','256') 
#SparkConf can be directly used with its .set  property
sc = SparkContext(conf=myconfig)

glueContext = GlueContext(sc)
..
..

**akshay** · Answer 5 · 2022-10-07T11:26:32+00:00

I had faced the same issue while converting the sparkDF to pandasDF. I am working on Azure-Databricks , first you need to check the memory set in the spark config using below -

spark.conf.get("spark.rpc.message.maxSize")

Then we can increase the memory-

spark.conf.set("spark.rpc.message.maxSize", "500")

**Daniel Sajo** · Answer 6 · 2024-03-08T10:38:24+00:00

Daniel Sajo On 08 March 2024 at 10:38

AWS Glue notebook solution: Run this cell magic before creating the spark and glue contexts. For me this is the only working way to set spark.rpc.message.maxSize in AWS Glue 4.0, as of March 2024.

%%configure -f
{
"conf": "spark.rpc.message.maxSize=512"
}

TechQA.

Pyspark: Serialized task exceeds max allowed. Consider increasing spark.rpc.message.maxSize or using broadcast variables for large values

There are 6 answers

Related Questions in DATAFRAME

Related Questions in PYSPARK

Related Questions in MESSAGE

Related Questions in RPC

Related Questions in MAX-SIZE

Popular Questions

Trending Questions