convert spark.sql.DataFrame to Array[Array[Double]]

2.1k views Asked by At

I'm working in spark and, to employ the Matrix class of the Jama library, I need to convert the content of a spark.sql.DataFrame to a 2D array, i.e., Array[Array[Double]].

While I've found quite several solutions on how to convert a single column of a dataframe to an array, I don't understand how to

  1. transform an entire dataframe into a 2D array (that is, an array of arrays);
  2. while doing so, casting its content from long to Double.

The reason for that is that I need to load the content of a dataframe into a Jama matrix, which requires a 2D array of Doubles as input:

val matrix_transport = new Matrix(df_transport)

<console>:83: error: type mismatch;
 found   : org.apache.spark.sql.DataFrame
    (which expands to)  org.apache.spark.sql.Dataset[org.apache.spark.sql.Row]
 required: Array[Array[Double]]
       val matrix_transport = new Matrix(df_transport)

EDIT: for completeness, the df schema is:

df_transport.printSchema

root
 |-- 1_51501_19962: long (nullable = true)
 |-- 1_51501_26708: long (nullable = true)
 |-- 1_51501_36708: long (nullable = true)
 |-- 1_51501_6708: long (nullable = true)
...

with 165 columns of identical type long.

1

There are 1 answers

1
Ryan Widmaier On BEST ANSWER

Here is the rough code to do it. That being said, I don't think Spark provides any guarantees on the order it is returning the rows in, so building the matrix distributed across the cluster may run into issues.

val df = Seq(
    (10l, 11l, 12l),
    (13l, 14l, 15l),
    (16l, 17l, 18l)
).toDF("c1", "c2", "c3")

// Group columns into a single array column
val rowDF = df.select(array(df.columns.map(col):_*) as "row")

// Pull data back to driver and convert Row objects to Arrays
val mat = rowDF.collect.map(_.getSeq[Long](0).toArray)

// Do the casting
val matDouble = mat.map(_.map(_.toDouble))