I have dataframe in spark3 that contains rows with TDigest type from isarnProject. TDigest itself is spark UDT and perfectly serialised in parquet, but I have no idea how to save it to external table in hive. (except write parquets to table location and repartition - colleagues say that it can be slow than df.insertInto(tableName) operation).
What I did:
Below I place some scala code for my case.
val data = spark.createDataFrame(Vector.fill(1000) {
(nextInt(10), nextGaussian)
})
val udf1 = TDigestAggregator.udf[Double](compression = 0.2, maxDiscrete = 25)
val agg = data.agg(udf1($"_1").as("td1"), udf1($"_2").as("td2"))
agg.printSchema()
// root
// |-- td1: tdigest (nullable = true)
// |-- td2: tdigest (nullable = true)
So, when I call agg.write.insertInto(realTableName) I get exception:
- Cannot write 'td1': struct<compression:double,maxDiscrete:int,cent:array<double>,mass:array<double>> is incompatible with struct<compression:double,maxDiscrete:int,cent:array<double>,mass:array<double>>
resultTableName was created by this script
CREATE EXTERNAL TABLE tableName (
td1 struct<compression:double,maxDiscrete:int,cent:array<double>,mass:array<double>>,
td2 struct<compression:double,maxDiscrete:int,cent:array<double>,mass:array<double>>
)
STORED AS PARQUET
LOCATION someLocation
The core reason of this error is wrong type interpretation at org.apache.spark.sql.types.DataType (line 466) from spark. Is this bug or exits way to handle it?
WA:
save as parquet in the table location;