How to save TDigest to Hive external table from dataframe with spark3

22 views Asked by Alexey Gurzhiy At 26 February 2024 at 09:16

I have dataframe in spark3 that contains rows with TDigest type from isarnProject. TDigest itself is spark UDT and perfectly serialised in parquet, but I have no idea how to save it to external table in hive. (except write parquets to table location and repartition - colleagues say that it can be slow than df.insertInto(tableName) operation).

What I did:

Below I place some scala code for my case.

      val data = spark.createDataFrame(Vector.fill(1000) {
        (nextInt(10), nextGaussian)
      })
      val udf1 = TDigestAggregator.udf[Double](compression = 0.2, maxDiscrete = 25)
      val agg = data.agg(udf1($"_1").as("td1"), udf1($"_2").as("td2"))
      agg.printSchema()
//      root
//      |-- td1: tdigest (nullable = true)
//      |-- td2: tdigest (nullable = true)

So, when I call agg.write.insertInto(realTableName) I get exception:

- Cannot write 'td1': struct<compression:double,maxDiscrete:int,cent:array<double>,mass:array<double>> is incompatible with struct<compression:double,maxDiscrete:int,cent:array<double>,mass:array<double>>

resultTableName was created by this script

CREATE EXTERNAL TABLE tableName (
          td1 struct<compression:double,maxDiscrete:int,cent:array<double>,mass:array<double>>, 
          td2 struct<compression:double,maxDiscrete:int,cent:array<double>,mass:array<double>>
        )
         STORED AS PARQUET
         LOCATION someLocation

The core reason of this error is wrong type interpretation at org.apache.spark.sql.types.DataType (line 466) from spark. Is this bug or exits way to handle it?

WA:

save as parquet in the table location;

Original Q&A

TechQA.

How to save TDigest to Hive external table from dataframe with spark3

What I did:

WA:

There are 0 answers

Related Questions in SCALA

Related Questions in APACHE-SPARK

Related Questions in SPARK3

Popular Questions

Trending Questions