DataHub Airflow Plugin
raw JSON → 1.5.0.6 verified Sun Apr 12 auth: no python
The `acryl-datahub-airflow-plugin` library provides an integration for Apache Airflow to automatically capture metadata, lineage, and run information from DAGs and tasks and send it to DataHub. It supports automatic column-level lineage extraction from various SQL operators, Airflow DAG and task properties, ownership, tags, and task run statuses. The plugin requires Airflow 2.7+ and Python 3.10+ and is currently at version 1.5.0.6, with a frequent release cadence tied to the broader DataHub project.
pip install 'acryl-datahub-airflow-plugin[airflow2]' Common errors
error ImportError: cannot import name 'CustomAssertionInfoClass' from 'datahub.metadata.schema_classes' ↓
cause This error occurs due to a mismatch between the versions of 'acryl-datahub-airflow-plugin' and 'datahub' packages, leading to missing or renamed classes.
fix
Ensure that both 'acryl-datahub-airflow-plugin' and 'datahub' packages are updated to compatible versions. Refer to the official documentation for the correct version compatibility matrix.
error ModuleNotFoundError: No module named 'airflow.providers.common.compat.openlineage.utils' ↓
cause This error arises because the 'airflow-openlineage' package has been deprecated in Airflow 2.10.2, and the plugin needs to support the official 'apache-airflow-providers-openlineage' package.
fix
Upgrade to the official 'apache-airflow-providers-openlineage' package and ensure that 'acryl-datahub-airflow-plugin' is updated to a version that supports this package.
error ModuleNotFoundError: No module named 'datahub.metadata._schema_classes' ↓
cause This error indicates a version mismatch between the DataHub CLI and server, leading to missing modules.
fix
Ensure that the DataHub CLI version matches the server version. Review the release notes for any breaking changes and update the ingestion recipe syntax if necessary.
error ModuleNotFoundError: No module named 'plugins' ↓
cause This error occurs when Airflow cannot find the 'plugins' module due to incorrect import statements or misconfigured plugin paths.
fix
Ensure that the 'plugins' directory is correctly set up and that import statements are correctly referencing the modules within the 'plugins' directory.
error ModuleNotFoundError: No module named 'airflow.providers.common.compat.openlineage.utils' OR ModuleNotFoundError: No module named 'airflow.models.mappedoperator' ↓
cause This error typically occurs due to an incompatibility between the `acryl-datahub-airflow-plugin` version and your Apache Airflow environment's specific version or its OpenLineage provider dependencies, especially when Airflow internal modules are deprecated or moved.
fix
Ensure you are using the correct installation command and extra for your Airflow version. For Airflow 2.7+, use
pip install 'acryl-datahub-airflow-plugin[airflow2]' or install apache-airflow-providers-openlineage separately with pip install acryl-datahub-airflow-plugin. For Airflow 3.1+, use pip install 'acryl-datahub-airflow-plugin[airflow3]'. Warnings
breaking Python 3.9 support has been dropped; all `acryl-datahub` modules, including the Airflow plugin, now require Python 3.10 or later. Upgrade Python before upgrading the plugin. ↓
fix Upgrade your Python environment to 3.10 or newer.
breaking The `acryl-datahub-airflow-plugin` has dropped support for Airflow versions less than 2.7. Users on older Airflow versions must upgrade Airflow or pin to an older plugin version. ↓
fix Upgrade Airflow to version 2.7+ (or 3.1+ for Airflow 3.x).
breaking The v1 plugin (`DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN=true`) has been removed. The v2 plugin is now the default. Users explicitly setting `DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN=true` must upgrade or pin to an older plugin version. ↓
fix Remove `DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN=true` configuration. The v2 plugin is now default.
breaking The latest DataHub Airflow plugin is not compatible with Airflow 3.2+ due to a deprecated import `airflow.models.mappedoperator`, causing `ModuleNotFoundError`. ↓
fix Downgrade Airflow to 3.1.x or await a plugin update that addresses Airflow 3.2+ compatibility.
gotcha Airflow 3.0.6 pins `pydantic==2.11.7`, which contains a bug preventing the DataHub plugin from importing correctly. This issue is resolved in Airflow 3.1.0+ (which uses `pydantic>=2.11.8`). ↓
fix Upgrade Airflow to 3.1.0 or later, or manually upgrade `pydantic` to `>=2.11.8` (may cause dependency conflicts).
gotcha The 'kill switch' to disable the plugin differs between Airflow versions. For Airflow 2.x, use an Airflow Variable `datahub_airflow_plugin_disable_listener` set to `true`. For Airflow 3.x, use the environment variable `AIRFLOW_VAR_DATAHUB_AIRFLOW_PLUGIN_DISABLE_LISTENER=true`. ↓
fix Use the appropriate method (Airflow Variable for 2.x, environment variable for 3.x) to disable the plugin.
gotcha Errors like 'Unable to emit metadata to DataHub GMS' often stem from incorrect URL encoding in the Airflow DataHub connection string, specifically for the `/api/gms` path. Airflow may not correctly interpret unencoded slashes in connection hosts. ↓
fix URL-encode the host portion of your Airflow connection string. For example, `http://datahub-gms:8080/api/gms` should be `datahub-rest://datahub-gms:8080%2Fapi%2Fgms` (or if your host *includes* the path, `my-datahub-host.net%2Fapi%2Fgms`).
Install
pip install 'apache-airflow-providers-openlineage>=1.0.0'
pip install acryl-datahub-airflow-plugin pip install 'acryl-datahub-airflow-plugin[airflow3]' pip install 'acryl-datahub-airflow-plugin[airflow3,datahub-kafka]' Imports
- prepare_lineage, apply_lineage
from datahub_airflow_plugin.operators.lineage import prepare_lineage, apply_lineage
Quickstart
import os
from airflow import DAG
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
from airflow.operators.bash import BashOperator
from datetime import datetime
# Ensure DataHub REST connection is configured in Airflow UI or via CLI:
# airflow connections add --conn-type 'datahub-rest' 'datahub_rest_default' --conn-host 'http://datahub-gms:8080' --conn-password "$DATAHUB_AUTH_TOKEN"
with DAG(
dag_id='datahub_example_dag',
start_date=datetime(2023, 1, 1),
schedule_interval=None,
catchup=False,
tags=['datahub', 'lineage', 'example'],
) as dag:
start_task = BashOperator(
task_id='start_task',
bash_command='echo "Starting DAG"',
)
# Example of a task that would automatically get lineage if it's a supported SQL operator
# (e.g., PostgresOperator, BigQueryInsertJobOperator, etc., not shown here for brevity).
# The plugin automatically extracts lineage based on OpenLineage events.
process_data_task = BashOperator(
task_id='process_data_task',
bash_command='echo "Processing data..." && sleep 5',
# For manual lineage, you can use inlets/outlets attributes (table-level only)
# inlets={'datasets': [{'platform': 'postgres', 'name': 'mydb.public.source_table'}]},
# outlets={'datasets': [{'platform': 'postgres', 'name': 'mydb.public.target_table'}]},
)
end_task = BashOperator(
task_id='end_task',
bash_command='echo "DAG finished"',
)
start_task >> process_data_task >> end_task