Apache Airflow Apache Hive Provider
The `apache-airflow-providers-apache-hive` package provides Apache Airflow operators, hooks, and sensors for interacting with Apache Hive. It supports both HiveServer2 connections (via `HiveHook`) and direct Hive CLI execution (via `HiveCliHook`). Currently at version 9.4.2, it follows the Apache Airflow providers release cycle, typically releasing new versions quarterly or as needed with Airflow major/minor releases.
Common errors
-
ModuleNotFoundError: No module named 'pyhive'
cause The `HiveHook` (used for HiveServer2 connections) requires the `pyhive` library, which is not a direct dependency of the provider package.fixInstall `pyhive` along with the provider: `pip install apache-airflow-providers-apache-hive[kerberos]` (if using Kerberos) or `pip install pyhive` if not using any specific extras. -
airflow.exceptions.AirflowException: Could not find `hive` command in the PATH. Please ensure Hive CLI is installed and configured.
cause The `HiveCliOperator` or `HiveOperator` configured to use `hive_cli_conn_id` cannot locate the `hive` command-line interface on the Airflow worker.fixInstall Apache Hive client utilities on the Airflow worker machine and ensure the `hive` executable is in the system's PATH environment variable. Alternatively, switch to using `hive_conn_id` and `HiveOperator` with a HiveServer2 connection and `pyhive`. -
pyhive.exc.OperationalError: TTransportException: Could not connect to ...
cause The `HiveHook` failed to establish a connection to HiveServer2. This can be due to incorrect host/port, network issues, or an inaccessible HiveServer2.fixVerify the Hive connection details (host, port, schema) in the Airflow UI. Ensure HiveServer2 is running and accessible from the Airflow worker. Check firewall rules and network connectivity. Enable debug logging for `pyhive` for more detailed connection errors. -
sqlalchemy.exc.DBAPIError: (pyhive.exc.OperationalError) TTransportException: GSS-API (or Kerberos) authentication failed
cause Kerberos authentication failed when `HiveHook` tried to connect to HiveServer2. This is often due to misconfigured keytab, principal, or client environment.fixEnsure the Kerberos keytab is valid and accessible, the principal matches the service principal, and the `kinit` command can successfully obtain a ticket. Verify the Hive connection in Airflow UI has 'Auth Mechanism' set to 'Kerberos' and 'Principal' and 'Keytab Path' are correct. Install `apache-airflow-providers-apache-hive[kerberos]`.
Warnings
- breaking Airflow 1.x `contrib` operators/hooks were moved to provider packages in Airflow 2.x. Direct imports from `airflow.contrib` will fail.
- gotcha Distinction between `HiveCliOperator` (or `HiveOperator` with `hive_cli_conn_id`) and `HiveOperator` (with `hive_conn_id`). They use different underlying mechanisms and require different connection configurations.
- gotcha Missing required underlying Python libraries for HiveServer2 connections (e.g., `pyhive`, `thrift-sasl`).
- gotcha Complexities with Kerberos authentication for Hive connections.
Install
-
pip install apache-airflow-providers-apache-hive -
pip install apache-airflow-providers-apache-hive[jdbc,kerberos,presto,s3,samba,sasl] -
pip install apache-airflow-providers-apache-hive[kerberos]
Imports
- HiveOperator
from airflow.contrib.operators.hive_operator import HiveOperator
from airflow.providers.apache.hive.operators.hive import HiveOperator
- HiveCliOperator
from airflow.contrib.operators.hive_cli_operator import HiveCliOperator
from airflow.providers.apache.hive.operators.hive_cli import HiveCliOperator
- HiveHook
from airflow.contrib.hooks.hive_hook import HiveHook
from airflow.providers.apache.hive.hooks.hive import HiveHook
- HiveCliHook
from airflow.contrib.hooks.hive_cli_hook import HiveCliHook
from airflow.providers.apache.hive.hooks.hive_cli import HiveCliHook
- HivePartitionSensor
from airflow.contrib.sensors.hive_partition_sensor import HivePartitionSensor
from airflow.providers.apache.hive.sensors.hive_partition import HivePartitionSensor
Quickstart
from __future__ import annotations
import pendulum
from airflow.models.dag import DAG
from airflow.providers.apache.hive.operators.hive import HiveOperator
with DAG(
dag_id='hive_example_dag',
start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
catchup=False,
schedule=None,
tags=['hive', 'example'],
) as dag:
# Example of running a Hive query via HiveServer2 (requires 'hive_conn_id' and PyHive)
run_hive_query = HiveOperator(
task_id='run_hive_query',
hive_cli_conn_id='hive_cli_default', # Or 'hive_default' for HiveServer2 connection
hql='''
CREATE TABLE IF NOT EXISTS my_test_table (
id INT,
name STRING
);
INSERT INTO TABLE my_test_table VALUES (1, 'Alice');
SELECT COUNT(*) FROM my_test_table;
''',
# schema='default' # Optional: Specify the target schema
)