HdfsCLI: API and Command Line Interface for HDFS

2.7.3 · active · verified Thu Apr 09

HdfsCLI provides a Python API and command-line interface for interacting with Hadoop HDFS via the WebHDFS (and HttpFS) API. It supports both secure and insecure clusters, offering Python 3 bindings for common HDFS operations. The library includes optional extensions for handling Avro files, Pandas DataFrames, and Kerberos authentication. The current version, 2.7.3, was released on October 12, 2023, indicating active maintenance.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to establish a connection to an HDFS Namenode using `InsecureClient`, write a simple file, list directory contents, read the file back, and then delete it. It uses environment variables for the Namenode URL and user for flexibility. Ensure your HDFS cluster is running and accessible at the specified URL.

import os
from hdfs.client import InsecureClient

HDFS_NAMENODE_URL = os.environ.get('HDFS_NAMENODE_URL', 'http://localhost:50070')
HDFS_USER = os.environ.get('HDFS_USER', 'guest') # Or a specific HDFS user

try:
    client = InsecureClient(HDFS_NAMENODE_URL, user=HDFS_USER)
    print(f"Connected to HDFS at {HDFS_NAMENODE_URL} as user {HDFS_USER}")

    # Example: Create a file
    hdfs_path = '/user/temp/my_test_file.txt'
    local_data = b'Hello, HdfsCLI world!'
    with client.write(hdfs_path, encoding='utf-8', overwrite=True) as writer:
        writer.write(local_data.decode('utf-8'))
    print(f"Successfully wrote to {hdfs_path}")

    # Example: List contents of a directory
    parent_dir = os.path.dirname(hdfs_path)
    if parent_dir == '': parent_dir = '/' # handle root edge case
    print(f"Contents of {parent_dir}:")
    for item in client.list(parent_dir):
        print(f"- {item}")
    
    # Example: Read the file back
    with client.read(hdfs_path, encoding='utf-8') as reader:
        read_data = reader.read()
    print(f"Read from {hdfs_path}: {read_data}")

    # Example: Delete the file
    client.delete(hdfs_path)
    print(f"Successfully deleted {hdfs_path}")

except Exception as e:
    print(f"An error occurred: {e}")
    print("Please ensure HDFS is running and HDFS_NAMENODE_URL/HDFS_USER are correctly configured.")

view raw JSON →