I see there is hdfs3, snakebite, and some others. Which one is the best supported and comprehensive?
What's the best module for interacting with HDFS with Python3?
28.5k views Asked by Farhat AtThere are 5 answers
On
I have tried snakebite, hdfs3 and hdfs.
Snakebite supports only download (no upload) so it's no go for me.
Out of these 3 only hdfs3 supports HA set up, so it was my choice, however I didn't manage to make it work with multihomed networks using datanode hostnames (problem described here: https://rainerpeter.wordpress.com/2014/02/12/connect-to-hdfs-running-in-ec2-using-public-ip-addresses/)
So I ended up using hdfs (2.0.16) as it supports uploads. I had to add some workaround using bash to support HA.
PS. There's interesting article comparing Python libraries developed for interacting with the Hadoop File System at http://wesmckinney.com/blog/python-hdfs-interfaces/
On
pyarrow, the python implementation of apache arrow has a well maintained and documented HDFS client: https://arrow.apache.org/docs/python/filesystems.html
On
I found pyhdfs-client really good for large files. (File taking 1 hour with webhdfs got loaded in 18 mins with it).
pip install pyhdfs-client
Only downside is, it's new and it's interface is not clean as compared to other hdfs clients. Documentation is missing however you can check usage here: https://pypi.org/project/pyhdfs-client/
As far as I know, there are not as many possibilities as one may think. But I'd suggest the official Python Package
hdfs 2.0.12which can be downloaded the website or from terminal by running:Some of the features: