HdfsCLI: API and Command Line Interface for HDFS
HdfsCLI provides a Python API and command-line interface for interacting with Hadoop HDFS via the WebHDFS (and HttpFS) API. It supports both secure and insecure clusters, offering Python 3 bindings for common HDFS operations. The library includes optional extensions for handling Avro files, Pandas DataFrames, and Kerberos authentication. The current version, 2.7.3, was released on October 12, 2023, indicating active maintenance.
Warnings
- breaking HdfsCLI version 2.x and above has dropped official support for Python 2.x. It is compatible with Python 3.7+.
- gotcha By default, `client.write()` will raise an `HdfsError` if trying to write to an existing path. To overwrite an existing file, you must explicitly set `overwrite=True`.
- gotcha Deleting a non-empty directory without `recursive=True` will raise an `HdfsError`. This is a safety mechanism.
- gotcha Using `Client.from_alias()` relies on a configuration file (default: `~/.hdfscli.cfg`) which defines cluster connection details. Without proper configuration, this method will fail.
- gotcha The `KerberosClient` requires the `hdfs[kerberos]` extra to be installed and proper Kerberos configuration on the client machine and HDFS cluster. Misconfiguration often leads to authentication errors.
Install
-
pip install hdfs -
pip install hdfs[avro,dataframe,kerberos]
Imports
- InsecureClient
from hdfs.client import InsecureClient
- Client
from hdfs.client import Client
- TokenClient
from hdfs.client import TokenClient
- KerberosClient
from hdfs.ext.kerberos import KerberosClient
Quickstart
import os
from hdfs.client import InsecureClient
HDFS_NAMENODE_URL = os.environ.get('HDFS_NAMENODE_URL', 'http://localhost:50070')
HDFS_USER = os.environ.get('HDFS_USER', 'guest') # Or a specific HDFS user
try:
client = InsecureClient(HDFS_NAMENODE_URL, user=HDFS_USER)
print(f"Connected to HDFS at {HDFS_NAMENODE_URL} as user {HDFS_USER}")
# Example: Create a file
hdfs_path = '/user/temp/my_test_file.txt'
local_data = b'Hello, HdfsCLI world!'
with client.write(hdfs_path, encoding='utf-8', overwrite=True) as writer:
writer.write(local_data.decode('utf-8'))
print(f"Successfully wrote to {hdfs_path}")
# Example: List contents of a directory
parent_dir = os.path.dirname(hdfs_path)
if parent_dir == '': parent_dir = '/' # handle root edge case
print(f"Contents of {parent_dir}:")
for item in client.list(parent_dir):
print(f"- {item}")
# Example: Read the file back
with client.read(hdfs_path, encoding='utf-8') as reader:
read_data = reader.read()
print(f"Read from {hdfs_path}: {read_data}")
# Example: Delete the file
client.delete(hdfs_path)
print(f"Successfully deleted {hdfs_path}")
except Exception as e:
print(f"An error occurred: {e}")
print("Please ensure HDFS is running and HDFS_NAMENODE_URL/HDFS_USER are correctly configured.")