PyHive: Python Interface for Hive and Presto
PyHive provides Python DB-API and SQLAlchemy interfaces for various data warehouses, primarily Apache Hive and Presto. It enables Python applications to connect, query, and fetch results from these systems. The current version is 0.7.0, with a release cadence that is somewhat sporadic, with significant gaps between major and minor releases.
Warnings
- breaking Major protocol and Thrift binding updates in PyHive versions can cause incompatibility with older HiveServer2 versions. Specifically, v0.2.0 changed to Hive protocol V6 (requiring Hive 0.13+), and v0.5.0 updated Thrift bindings to V11. Ensure your PyHive version matches the expected protocol/Thrift version of your HiveServer2.
- breaking PyHive v0.2.0 introduced changes to data return types: rows are now returned as tuples instead of lists, and binary data is returned as byte strings instead of Unicode strings. This can break existing code that assumes specific data types.
- gotcha PyHive's core installation (`pip install pyhive`) does not include necessary dependencies for Hive, Presto, Kerberos, or SQLAlchemy integration. These must be installed via optional extras.
- gotcha Older PyHive versions explicitly dropped support for specific SQLAlchemy versions (e.g., v0.5.0 dropped SQLAlchemy 0.6, v0.5.1 dropped SQLAlchemy 0.7). While newer PyHive versions generally aim for compatibility, always check if you're using an older PyHive with a very old or very new SQLAlchemy version.
Install
-
pip install pyhive[hive] -
pip install pyhive[presto] -
pip install pyhive[hive,presto,hive_kerberos,sqlalchemy]
Imports
- connect
from pyhive.hive import connect
- connect
from pyhive.presto import connect
- HiveConnection
from pyhive.hive import Connection as HiveConnection
- PrestoConnection
from pyhive.presto import Connection as PrestoConnection
Quickstart
import os
from pyhive import hive
# Example for Hive connection
# Ensure HiveServer2 is running and accessible
# Replace with your actual host, port, username, database
host = os.environ.get('HIVE_HOST', 'localhost')
port = int(os.environ.get('HIVE_PORT', 10000))
username = os.environ.get('HIVE_USERNAME', 'anonymous')
database = os.environ.get('HIVE_DATABASE', 'default')
connection = None
cursor = None
try:
connection = hive.connect(host=host, port=port, username=username, database=database)
cursor = connection.cursor()
# Execute a query
cursor.execute('SELECT 1 + 1')
# Fetch results
result = cursor.fetchone()
print(f"Query result: {result}")
cursor.execute('SHOW TABLES')
tables = cursor.fetchall()
print("Available tables:")
for table in tables:
print(f" {table[0]}")
except Exception as e:
print(f"An error occurred: {e}")
finally:
if cursor:
cursor.close()
if connection:
connection.close()