Apache Spark orc read performance when reading large number of small files

1.3k views Asked by At

When reading large number of orc files from HDFS under a directory spark doesn't launch any tasks until some amount of time and I don't see any tasks running during that time. I'm using below command to read orc and spark.sql configs.

What spark is doing under hoods when spark.read.orc is issued?

spark.read.schema(schame1).orc("hdfs://test1").filter("date >= 20181001")
"spark.sql.orc.enabled": "true",
"spark.sql.orc.filterPushdown": "true

Also instead of directly reading orc files I tried running Hive query on same dataset. But I was not able to push filter predicate. Where should I set the below config's "hive.optimize.ppd":"true", "hive.optimize.ppd.storage":"true"

Suggest what is the best way to read orc files from HDFS and tuning parameters ?

0

There are 0 answers