Snakebite-py3: Pure Python HDFS Client
snakebite-py3 is a Python library that provides a pure Python client for the Hadoop Distributed File System (HDFS). It communicates directly with the HDFS NameNode using protobuf messages and implements the Hadoop RPC protocol, offering a native Python alternative to calling Java-based `hadoop fs` commands. This fork, maintained by the Internet Archive, specifically targets Python 3 compatibility. The current version is 3.0.6, released in February 2025.
Warnings
- breaking The `snakebite-py3` library is a Python 3 fork of the original `snakebite`, which was Python 2 only. Projects migrating from `snakebite` must switch to `snakebite-py3` and ensure their codebase is Python 3 compatible. Attempting to use the original `snakebite` in Python 3 environments will result in errors.
- gotcha Many methods within `snakebite.client` (e.g., `ls`, `mkdir`, `rm`) return generators. The actual HDFS operation is only executed when the generator is consumed (e.g., by iterating over it with a `for` loop or converting it to a `list()`). Failing to consume the generator means the operation will not be performed.
- gotcha Unlike the standard Hadoop client, `snakebite-py3` disables CRC (Cyclic Redundancy Check) for data transfers by default to improve performance. This means data integrity is not verified during transfer unless explicitly enabled.
- gotcha `snakebite-py3` has primarily been tested with specific Hadoop distributions like CDH5 and supports Hadoop 2.2.0+ (protocol version 9). Compatibility with newer Hadoop versions or different distributions might vary and may require specifying the `hadoop_version` parameter in the `Client` constructor.
- gotcha The `Client` constructor parameter `use_trash` is often set to `False` in examples and defaults to `False` in many contexts, meaning file deletions can be permanent without moving to the HDFS trash.
Install
-
pip install snakebite-py3
Imports
- Client
from snakebite.client import Client
- Client
from snakebite.client import Client
Quickstart
import os
from snakebite.client import Client
# Configure HDFS NameNode host and port
# Default HDFS RPC port is 8020
host = os.environ.get('HDFS_NAMENODE_HOST', 'localhost')
port = int(os.environ.get('HDFS_NAMENODE_PORT', '8020'))
try:
# Initialize the HDFS client
# It's recommended to set use_trash=False for non-interactive scripts
# Or explicitly set hadoop_version if not the default (9)
client = Client(host, port, use_trash=False)
print(f"Connected to HDFS NameNode at {host}:{port}")
# Example: List contents of the root directory
print("Listing /:")
for item in client.ls(['/']):
print(item)
# Example: Create a directory
test_dir = '/user/test_snakebite_py3'
if not list(client.ls([test_dir])):
print(f"Creating directory {test_dir}")
list(client.mkdir([test_dir], create_parents=True))
else:
print(f"Directory {test_dir} already exists.")
except Exception as e:
print(f"An error occurred: {e}")
print("Please ensure your HDFS NameNode is running and accessible at the specified host and port.")