Spark: parallelize hdfs URLs with data locality awarness

112 views Asked by At

I have a list of HDFS zip file URLs and I want to open the each file inside RDD map function instead of using binaryFiles function.

Initially, I tried like below:

def unzip(hdfs_url):
  # read the hdfs file using hdfs python client

rdd = spark.sparkContext.parallelize(list_of_hdfs_urls, 16) # make 16 partitions
rdd.map(lambda a: unzip(a)) 

But later I realized that this wouldn't give data locality, even though it runs parallelly across the cluster.

Is there any way to run the map function for a file url x on the node where hdfs file x is located, how to make spark aware of this locality.

I want to read zip files in this manner to get better performance in pyspark, and hence I can avoid file serialization and de-serialization between python and java process on each executor.

0

There are 0 answers