Environment
I am using spark v2.4.4 via the python API
Problem
According to the spark documentation I can force spark to download all the hive jars for interacting with my hive_metastore by setting the following config
spark.sql.hive.metastore.version=${my_version}
spark.sql.hive.metastore.jars=maven
However, when I run the following python code, no jar files are downloaded from maven.
from pyspark.sql import SparkSession
from pyspark import SparkConf
conf = (
SparkConf()
.setAppName("myapp")
.set("spark.sql.hive.metastore.version", "2.3.3")
.set("spark.sql.hive.metastore.jars","maven")
)
spark = (
SparkSession
.builder
.config(conf=conf)
.enableHiveSupport()
.getOrCreate()
)
How do I know that no jar files are downloaded?
- I have configured logLevel=INFO as a default by setting
log4j.logger.org.apache.spark.api.python.PythonGatewayServer=INFO
in $SPARK_HOME/conf/log4j.properties. I can see no logging which says that spark is interacting with maven. according to this I should see an INFO level log - Even if for some reason my logging was broken, the SparkSession object is simply building too quickly to be pulling large jars from maven. It returns in under 5 seconds. If I manually add the maven coordinates of hive_metastore to "spark.jars.packages" it takes minutes to download it all
- I have deleted ~/.ivy2 and ~/.m2 directories to remove caching of previous downloads
Other tests
- I have also tried the same code on a spark 3.0.0 cluster and it doesn't work either
- Can anyone spot what I'm doing wrong? Or is this option just broken
For anyone else trying to solve this:
spark.catalog.listDatabases()