Apache Airflow Apache Hive Provider

raw JSON →
9.4.2 verified Thu Apr 16 auth: no python

The `apache-airflow-providers-apache-hive` package provides Apache Airflow operators, hooks, and sensors for interacting with Apache Hive. It supports both HiveServer2 connections (via `HiveHook`) and direct Hive CLI execution (via `HiveCliHook`). Currently at version 9.4.2, it follows the Apache Airflow providers release cycle, typically releasing new versions quarterly or as needed with Airflow major/minor releases.

pip install apache-airflow-providers-apache-hive
error ModuleNotFoundError: No module named 'pyhive'
cause The `HiveHook` (used for HiveServer2 connections) requires the `pyhive` library, which is not a direct dependency of the provider package.
fix
Install pyhive along with the provider: pip install apache-airflow-providers-apache-hive[kerberos] (if using Kerberos) or pip install pyhive if not using any specific extras.
error airflow.exceptions.AirflowException: Could not find `hive` command in the PATH. Please ensure Hive CLI is installed and configured.
cause The `HiveCliOperator` or `HiveOperator` configured to use `hive_cli_conn_id` cannot locate the `hive` command-line interface on the Airflow worker.
fix
Install Apache Hive client utilities on the Airflow worker machine and ensure the hive executable is in the system's PATH environment variable. Alternatively, switch to using hive_conn_id and HiveOperator with a HiveServer2 connection and pyhive.
error pyhive.exc.OperationalError: TTransportException: Could not connect to ...
cause The `HiveHook` failed to establish a connection to HiveServer2. This can be due to incorrect host/port, network issues, or an inaccessible HiveServer2.
fix
Verify the Hive connection details (host, port, schema) in the Airflow UI. Ensure HiveServer2 is running and accessible from the Airflow worker. Check firewall rules and network connectivity. Enable debug logging for pyhive for more detailed connection errors.
error sqlalchemy.exc.DBAPIError: (pyhive.exc.OperationalError) TTransportException: GSS-API (or Kerberos) authentication failed
cause Kerberos authentication failed when `HiveHook` tried to connect to HiveServer2. This is often due to misconfigured keytab, principal, or client environment.
fix
Ensure the Kerberos keytab is valid and accessible, the principal matches the service principal, and the kinit command can successfully obtain a ticket. Verify the Hive connection in Airflow UI has 'Auth Mechanism' set to 'Kerberos' and 'Principal' and 'Keytab Path' are correct. Install apache-airflow-providers-apache-hive[kerberos].
breaking Airflow 1.x `contrib` operators/hooks were moved to provider packages in Airflow 2.x. Direct imports from `airflow.contrib` will fail.
fix Update all imports from `airflow.contrib.operators.hive_operator` or similar to `airflow.providers.apache.hive.operators.hive` and corresponding paths for hooks and sensors.
gotcha Distinction between `HiveCliOperator` (or `HiveOperator` with `hive_cli_conn_id`) and `HiveOperator` (with `hive_conn_id`). They use different underlying mechanisms and require different connection configurations.
fix `HiveCliOperator` (and `HiveOperator` using `hive_cli_conn_id`) expects the `hive` command-line tool to be available on the Airflow worker and configured correctly. `HiveOperator` using `hive_conn_id` (HiveServer2) requires a DBAPI driver like `pyhive` and a proper HiveServer2 connection setup in Airflow UI.
gotcha Missing required underlying Python libraries for HiveServer2 connections (e.g., `pyhive`, `thrift-sasl`).
fix Install the necessary extras or packages: `pip install apache-airflow-providers-apache-hive[kerberos]` or manually `pip install pyhive thrift-sasl`. The specific requirements depend on the connection type (e.g., Kerberos, LDAP).
gotcha Complexities with Kerberos authentication for Hive connections.
fix Ensure Kerberos client (`kinit`) is properly configured on the Airflow worker, keytabs are accessible, and `KRB5_KTNAME` environment variable is set if using a non-default keytab path. Also, install `apache-airflow-providers-apache-hive[kerberos]` and configure the Hive connection in Airflow UI with the 'Auth Mechanism' set to 'Kerberos'.
pip install apache-airflow-providers-apache-hive[jdbc,kerberos,presto,s3,samba,sasl]
pip install apache-airflow-providers-apache-hive[kerberos]

This quickstart demonstrates a basic DAG using `HiveOperator` to execute HQL (Hive Query Language). It uses `hive_cli_conn_id='hive_cli_default'` which typically relies on the `hive` CLI being available in the Airflow worker's environment. For connecting to HiveServer2, configure a Hive connection in Airflow UI (e.g., `hive_default`) and use `hive_conn_id='hive_default'` in the operator, ensuring `pyhive` is installed.

from __future__ import annotations

import pendulum

from airflow.models.dag import DAG
from airflow.providers.apache.hive.operators.hive import HiveOperator


with DAG(
    dag_id='hive_example_dag',
    start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
    catchup=False,
    schedule=None,
    tags=['hive', 'example'],
) as dag:
    # Example of running a Hive query via HiveServer2 (requires 'hive_conn_id' and PyHive)
    run_hive_query = HiveOperator(
        task_id='run_hive_query',
        hive_cli_conn_id='hive_cli_default', # Or 'hive_default' for HiveServer2 connection
        hql='''
            CREATE TABLE IF NOT EXISTS my_test_table (
                id INT,
                name STRING
            );
            INSERT INTO TABLE my_test_table VALUES (1, 'Alice');
            SELECT COUNT(*) FROM my_test_table;
        ''',
        # schema='default' # Optional: Specify the target schema
    )