Error writing dataframe in redshift using pyspark with boolean columns

747 views Asked by At

In my script the write method of PySpark takes a data frame and writes it a Redshift, however in some dataframes there are boolean columns that return error stating that Redshift does not accept bit data type.

My question is because it says that what should be boolean is bit.

The code:

spark = (
    SparkSession.builder.appName("data_quality")
    .config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
    .config("spark.jars", "redshift-jdbc42-2.1.0.3.jar")
    .config("spark.hadoop.fs.s3a.access.key", "key")
    .config("spark.hadoop.fs.s3a.secret.key", "secret_key" )
    .config("spark.hadoop.fs.s3a.session.token","tokiem")
    .config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
    .getOrCreate()
)

df = spark.createDataFrame(
    [
        (1, False),  # create your data here, be consistent in the types.
        (2, True),
    ],
    ["id", 'column_type_bool']  # add your column names here
)
df.show()
df.dtypes

df.write \
  .format("jdbc") \
  .option("url", f"jdbc:redshift://{url_db}:5439/{db_name}") \
  .option("driver", "com.amazon.redshift.jdbc42.Driver") \
  .option("dbtable", f"{schema}.{tab}") \
  .option("user", user_db) \
  .option("password", pw) \
  .option("tempdir", "s3a://path") \
  .mode("overwrite") \
  .save()

The Table:

root |-- namecolumn: boolean (nullable = true)

The error:

Py4JJavaError: An error occurred while calling o113.save.
: com.amazon.redshift.util.RedshiftException: ERROR: Column "nametable.namecolumn" has unsupported type "bit".
    at com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2601)
    at com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2269)
    at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1880)
    at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1872)
    at com.amazon.redshift.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeInternal(RedshiftStatementImpl.java:514)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.execute(RedshiftStatementImpl.java:435)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeWithFlags(RedshiftStatementImpl.java:376)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeCachedSql(RedshiftStatementImpl.java:362)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeWithFlags(RedshiftStatementImpl.java:339)
    at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeUpdate(RedshiftStatementImpl.java:297)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.executeStatement(JdbcUtils.scala:1026)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:912)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:80)
1

There are 1 answers

1
dineshkumar g On

This error occurs because Redshift expects Boolean columns to be written as TRUE or FALSE, but PySpark writes them as true or false (lowercase). To fix this error, you can cast the Boolean columns to strings and then use the upper function to convert them to uppercase:**

from pyspark.sql.functions import col, upper
    
df = df.withColumn("col2_str", col("col2").cast("string"))
df = df.withColumn("col2_redshift", upper(col("col2_str")))
df = df.drop("col2").withColumnRenamed("col2_redshift", "col2")

After this try to write the dataframe.