In my script the write method of PySpark takes a data frame and writes it a Redshift, however in some dataframes there are boolean columns that return error stating that Redshift does not accept bit data type.
My question is because it says that what should be boolean is bit.
The code:
spark = (
SparkSession.builder.appName("data_quality")
.config("fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider")
.config("spark.jars", "redshift-jdbc42-2.1.0.3.jar")
.config("spark.hadoop.fs.s3a.access.key", "key")
.config("spark.hadoop.fs.s3a.secret.key", "secret_key" )
.config("spark.hadoop.fs.s3a.session.token","tokiem")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.getOrCreate()
)
df = spark.createDataFrame(
[
(1, False), # create your data here, be consistent in the types.
(2, True),
],
["id", 'column_type_bool'] # add your column names here
)
df.show()
df.dtypes
df.write \
.format("jdbc") \
.option("url", f"jdbc:redshift://{url_db}:5439/{db_name}") \
.option("driver", "com.amazon.redshift.jdbc42.Driver") \
.option("dbtable", f"{schema}.{tab}") \
.option("user", user_db) \
.option("password", pw) \
.option("tempdir", "s3a://path") \
.mode("overwrite") \
.save()
The Table:
root |-- namecolumn: boolean (nullable = true)
The error:
Py4JJavaError: An error occurred while calling o113.save.
: com.amazon.redshift.util.RedshiftException: ERROR: Column "nametable.namecolumn" has unsupported type "bit".
at com.amazon.redshift.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2601)
at com.amazon.redshift.core.v3.QueryExecutorImpl.processResultsOnThread(QueryExecutorImpl.java:2269)
at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1880)
at com.amazon.redshift.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:1872)
at com.amazon.redshift.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:368)
at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeInternal(RedshiftStatementImpl.java:514)
at com.amazon.redshift.jdbc.RedshiftStatementImpl.execute(RedshiftStatementImpl.java:435)
at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeWithFlags(RedshiftStatementImpl.java:376)
at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeCachedSql(RedshiftStatementImpl.java:362)
at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeWithFlags(RedshiftStatementImpl.java:339)
at com.amazon.redshift.jdbc.RedshiftStatementImpl.executeUpdate(RedshiftStatementImpl.java:297)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.executeStatement(JdbcUtils.scala:1026)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.createTable(JdbcUtils.scala:912)
at org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:80)
This error occurs because Redshift expects Boolean columns to be written as TRUE or FALSE, but PySpark writes them as true or false (lowercase). To fix this error, you can cast the Boolean columns to strings and then use the upper function to convert them to uppercase:**
After this try to write the dataframe.