I am using the following python code to upload a file to remote HDFS from my local system using pyhdfs
from pyhdfs import HdfsClient
client = HdfsClient(hosts='1.1.1.1',user_name='root')
client.mkdirs('/jarvis')
client.copy_from_local('/my/local/file,'/hdfs/path')
Using python3.5/. Hadoop is running in default port : 50070 1.1.1.1 is my remote Hadoop url
Creating dir "jarvis" is working fine, but copying a file is not working. I am getting the following error
Traceback (most recent call last):
File "test_hdfs_upload.py", line 14, in client.copy_from_local('/tmp/data.json','/test.json')
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyhdfs.py", line 753, in copy_from_local self.create(dest, f, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/pyhdfs.py", line 426, in create metadata_response.headers['location'], data=data, **self._requests_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 99, in put return request('put', url, data=data, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/api.py", line 44, in request return session.request(method=method, url=url, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 383, in request resp = self.send(prep, **send_kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/sessions.py", line 486, in send r = adapter.send(request, **kwargs)
File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages/requests/adapters.py", line 378, in send raise ConnectionError(e) requests.exceptions.ConnectionError: HTTPConnectionPool(host='ip-1-1-1-1', port=50075): Max retries exceeded with url: /webhdfs/v1/test.json?op=CREATE&user.name=root&namenoderpcaddress=ip-1-1-1-1:9000&overwrite=false (Caused by : [Errno 8] nodename nor servname provided, or not known)
First, check if
webhdfsis enabled for your HDFS cluster. PyHDFS library uses webhdfs and therefore webhdfs needs to be enabled in HDFS config. To enable webhdfs, modifyhdfs-site.xmlas follows:Also, when the
copy_from_local()API call is made from PyHDFS library, HDFS node manager randomly picks and allocates a node from HDFS cluster, and when it does, it may just return a domain name associated to that node. Then an HTTP connection is attempted to that domain to perform an operation. This is when it fails because that domain name isn't understood (can't be resolved) by your host.To resolve the domains, you will need to add appropriate domain mappings in your
/etc/hostsfile.For instance, if you have a HDFS cluster with a namenode and 2 datanodes, with following IP addresses and hostnames:
you will need to update your
/etc/hostsfile as follows:This will enable domain name resolution from your host to your HDFS cluster and you can make webhdfs API calls through PyHDFS.