Python Hive Server 2 Client Driver
pyHS2 is a Python client driver designed for connecting to Hive Server 2. The project's last stable release was 0.6.0 in November 2014. It is no longer actively maintained, with the developer ceasing support in early 2016 and recommending alternative libraries.
Common errors
-
ModuleNotFoundError: No module named 'cloudera'
cause This error typically occurs when running `pyhs2` with Python 3.x. `pyhs2` is not Python 3 compatible.fixUse Python 2.7 for your project, or switch to a Python 3-compatible Hive client like `PyHive` or `impyla`. -
thrift.transport.TTransport.TTransportException: Could not connect to localhost:10000
cause The Python client could not establish a network connection to the specified Hive Server 2 host and port. This could be due to the server not running, incorrect IP/hostname, firewall issues, or an incorrect port.fixVerify that Hive Server 2 is running and listening on the specified host and port (e.g., `localhost:10000`). Check firewall rules, network connectivity, and the `hive-site.xml` configuration for the correct HiveServer2 endpoint. -
fatal error: sasl/sasl.h: No such file or directory
cause During `pip install pyhs2`, the `sasl` dependency fails to compile because the system is missing the development headers for the Cyrus SASL library.fixInstall the appropriate system-level SASL development packages: For Red Hat/CentOS/Fedora, run `sudo yum install cyrus-sasl-devel`. For Debian/Ubuntu, run `sudo apt-get update && sudo apt-get install libsasl2-dev`. Then retry `pip install pyhs2`. -
pyhs2.error.Pyhs2Exception: 'Error while processing statement: FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask'
cause This error often indicates a problem with the Hive query itself, permissions for the user executing the query, or issues with the underlying Hadoop/MapReduce job that Hive attempts to run.fixFirst, try running the exact same Hive query directly in the Hive shell to isolate if it's a `pyhs2` or Hive/query issue. If the query works in Hive shell, check the user permissions in the `pyhs2.connect()` call. Ensure the user has appropriate read/write access to tables and databases in Hive.
Warnings
- breaking `pyhs2` does NOT support Python 3.x. It is designed for Python 2.x, and attempts to use it with Python 3.x will result in `ModuleNotFoundError` or other incompatibilities due to underlying dependencies like `sasl` and `cStringIO`.
- deprecated The `pyhs2` library is no longer maintained. The last release was in 2014, and the developer officially stated that maintenance ceased in January 2016, recommending alternatives.
- gotcha Installing `pyhs2` often fails due to missing system-level development headers for `cyrus-sasl`. The `sasl` Python package, a dependency, requires these headers to compile.
- gotcha Connection issues (`TTransport.TTransportException: Could not connect`) are frequently caused by incorrect host, port, authentication mechanism, user, or password. Hive Server 2 typically runs on port 10000.
- gotcha Fetching large result sets can appear to hang or be inefficient. The `fetchone()` and `hasMoreRows` pattern is recommended over simply iterating `cur.fetch()` for better control.
Install
-
pip install pyhs2 -
sudo yum install cyrus-sasl-devel # For Red Hat/CentOS/Fedora sudo apt-get update && sudo apt-get install libsasl2-dev # For Debian/Ubuntu
Imports
- connect
from pyhs2.connections import connect
import pyhs2 conn = pyhs2.connect(...)
- Pyhs2Exception
from pyhs2.error import Pyhs2Exception
Quickstart
import os
import pyhs2
hive_host = os.environ.get('HIVE_HOST', 'localhost')
hive_port = int(os.environ.get('HIVE_PORT', '10000'))
hive_user = os.environ.get('HIVE_USER', 'hive')
hive_password = os.environ.get('HIVE_PASSWORD', '')
hive_database = os.environ.get('HIVE_DATABASE', 'default')
try:
with pyhs2.connect(
host=hive_host,
port=hive_port,
authMechanism="PLAIN", # or "KERBEROS" or None
user=hive_user,
password=hive_password,
database=hive_database
) as conn:
print("Successfully connected to Hive Server 2.")
with conn.cursor() as cur:
# Show databases
print(f"Databases: {cur.getDatabases()}")
# Execute a query
cur.execute("SELECT * FROM some_table LIMIT 5")
# Return column info
print(f"Schema: {cur.getSchema()}")
# Fetch table results
print("Query Results:")
for row in cur.fetch():
print(row)
except pyhs2.error.Pyhs2Exception as e:
print(f"pyhs2 error: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")