{"id":5484,"library":"snakebite-py3","title":"Snakebite-py3: Pure Python HDFS Client","description":"snakebite-py3 is a Python library that provides a pure Python client for the Hadoop Distributed File System (HDFS). It communicates directly with the HDFS NameNode using protobuf messages and implements the Hadoop RPC protocol, offering a native Python alternative to calling Java-based `hadoop fs` commands. This fork, maintained by the Internet Archive, specifically targets Python 3 compatibility. The current version is 3.0.6, released in February 2025.","status":"active","version":"3.0.6","language":"en","source_language":"en","source_url":"https://github.com/internetarchive/snakebite-py3","tags":["HDFS","Hadoop","client","filesystem","big data","protobuf"],"install":[{"cmd":"pip install snakebite-py3","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Used for communication with the HDFS NameNode via RPC protocol.","package":"protobuf","optional":false},{"reason":"Required for Kerberos/SASL authentication, if enabled.","package":"pyasn1","optional":true},{"reason":"Required for Kerberos/SASL authentication, if enabled.","package":"pykerberos","optional":true}],"imports":[{"note":"The primary client class is located within the `snakebite.client` submodule. Directly importing from `snakebite` will fail.","wrong":"from snakebite import Client","symbol":"Client","correct":"from snakebite.client import Client"},{"note":"The package name on PyPI is `snakebite-py3`, but the internal import path remains `snakebite` for compatibility with the original project's API.","wrong":"from snakebite_py3.client import Client","symbol":"Client","correct":"from snakebite.client import Client"}],"quickstart":{"code":"import os\nfrom snakebite.client import Client\n\n# Configure HDFS NameNode host and port\n# Default HDFS RPC port is 8020\nhost = os.environ.get('HDFS_NAMENODE_HOST', 'localhost')\nport = int(os.environ.get('HDFS_NAMENODE_PORT', '8020'))\n\ntry:\n    # Initialize the HDFS client\n    # It's recommended to set use_trash=False for non-interactive scripts\n    # Or explicitly set hadoop_version if not the default (9)\n    client = Client(host, port, use_trash=False)\n\n    print(f\"Connected to HDFS NameNode at {host}:{port}\")\n\n    # Example: List contents of the root directory\n    print(\"Listing /:\")\n    for item in client.ls(['/']):\n        print(item)\n\n    # Example: Create a directory\n    test_dir = '/user/test_snakebite_py3'\n    if not list(client.ls([test_dir])):\n        print(f\"Creating directory {test_dir}\")\n        list(client.mkdir([test_dir], create_parents=True))\n    else:\n        print(f\"Directory {test_dir} already exists.\")\n\nexcept Exception as e:\n    print(f\"An error occurred: {e}\")\n    print(\"Please ensure your HDFS NameNode is running and accessible at the specified host and port.\")\n","lang":"python","description":"This quickstart demonstrates how to establish a connection to an HDFS NameNode and perform basic file system operations like listing directories and creating a new directory. It uses environment variables for host and port for flexibility, defaulting to `localhost:8020`. Ensure your HDFS cluster is running and accessible from where you execute this code."},"warnings":[{"fix":"Install `snakebite-py3` and update import statements to `from snakebite.client import Client`. Review code for any Python 2 specific constructs.","message":"The `snakebite-py3` library is a Python 3 fork of the original `snakebite`, which was Python 2 only. Projects migrating from `snakebite` must switch to `snakebite-py3` and ensure their codebase is Python 3 compatible. Attempting to use the original `snakebite` in Python 3 environments will result in errors.","severity":"breaking","affected_versions":"< 3.0.0 (original snakebite)"},{"fix":"Always iterate over the returned generator or wrap it in `list()` to ensure the HDFS command executes: `list(client.mkdir(['/new_dir']))`.","message":"Many methods within `snakebite.client` (e.g., `ls`, `mkdir`, `rm`) return generators. The actual HDFS operation is only executed when the generator is consumed (e.g., by iterating over it with a `for` loop or converting it to a `list()`). Failing to consume the generator means the operation will not be performed.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For operations requiring CRC checks (e.g., `cat`), pass `check_crc=True` as an argument to the method: `for chunk in client.cat(['/path/to/file'], check_crc=True): ...`","message":"Unlike the standard Hadoop client, `snakebite-py3` disables CRC (Cyclic Redundancy Check) for data transfers by default to improve performance. This means data integrity is not verified during transfer unless explicitly enabled.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If encountering connection issues or unexpected behavior, try explicitly setting the `hadoop_version` parameter: `client = Client(host, port, hadoop_version=some_version)`.","message":"`snakebite-py3` has primarily been tested with specific Hadoop distributions like CDH5 and supports Hadoop 2.2.0+ (protocol version 9). Compatibility with newer Hadoop versions or different distributions might vary and may require specifying the `hadoop_version` parameter in the `Client` constructor.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For safer deletion, explicitly set `use_trash=True` when initializing the client or when performing delete operations if supported by the method: `client = Client(host, port, use_trash=True)`.","message":"The `Client` constructor parameter `use_trash` is often set to `False` in examples and defaults to `False` in many contexts, meaning file deletions can be permanent without moving to the HDFS trash.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-13T00:00:00.000Z","next_check":"2026-07-12T00:00:00.000Z"}