{"id":4409,"library":"acryl-datahub-airflow-plugin","title":"DataHub Airflow Plugin","description":"The `acryl-datahub-airflow-plugin` library provides an integration for Apache Airflow to automatically capture metadata, lineage, and run information from DAGs and tasks and send it to DataHub. It supports automatic column-level lineage extraction from various SQL operators, Airflow DAG and task properties, ownership, tags, and task run statuses. The plugin requires Airflow 2.7+ and Python 3.10+ and is currently at version 1.5.0.6, with a frequent release cadence tied to the broader DataHub project.","status":"active","version":"1.5.0.6","language":"en","source_language":"en","source_url":"https://github.com/datahub-project/datahub","tags":["Data Governance","Airflow","ETL","Metadata","Lineage","Data Catalog"],"install":[{"cmd":"pip install 'acryl-datahub-airflow-plugin[airflow2]'","lang":"bash","label":"For Airflow 2.x (2.7+) with Legacy OpenLineage"},{"cmd":"pip install 'apache-airflow-providers-openlineage>=1.0.0'\npip install acryl-datahub-airflow-plugin","lang":"bash","label":"For Airflow 2.7+ with native OpenLineage provider"},{"cmd":"pip install 'acryl-datahub-airflow-plugin[airflow3]'","lang":"bash","label":"For Airflow 3.x (3.1+) with native OpenLineage provider"},{"cmd":"pip install 'acryl-datahub-airflow-plugin[airflow3,datahub-kafka]'","lang":"bash","label":"For Airflow 3.x with Kafka emitter support"}],"dependencies":[{"reason":"Core DataHub SDK and REST emitter, installed via extra 'sql-parser,datahub-rest'.","package":"acryl-datahub","optional":false},{"reason":"Required for data validation. Minimum version >=2.4.0.","package":"pydantic","optional":false},{"reason":"Required for Airflow integration. Supports versions 2.7+ and 3.1+.","package":"apache-airflow","optional":false},{"reason":"Legacy OpenLineage package for Airflow 2.x, installed via `[airflow2]` extra.","package":"openlineage-airflow","optional":true},{"reason":"Native OpenLineage provider for Airflow 3.x or 2.7+, installed via `[airflow3]` extra.","package":"apache-airflow-providers-openlineage","optional":true}],"imports":[{"note":"These decorators are used for custom Airflow operators to ensure lineage is captured during `pre_execute` and `post_execute` methods. The plugin generally functions without direct imports in standard DAGs.","symbol":"prepare_lineage, apply_lineage","correct":"from datahub_airflow_plugin.operators.lineage import prepare_lineage, apply_lineage"}],"quickstart":{"code":"import os\nfrom airflow import DAG\nfrom airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator\nfrom airflow.operators.bash import BashOperator\nfrom datetime import datetime\n\n# Ensure DataHub REST connection is configured in Airflow UI or via CLI:\n# airflow connections add --conn-type 'datahub-rest' 'datahub_rest_default' --conn-host 'http://datahub-gms:8080' --conn-password \"$DATAHUB_AUTH_TOKEN\"\n\nwith DAG(\n    dag_id='datahub_example_dag',\n    start_date=datetime(2023, 1, 1),\n    schedule_interval=None,\n    catchup=False,\n    tags=['datahub', 'lineage', 'example'],\n) as dag:\n    start_task = BashOperator(\n        task_id='start_task',\n        bash_command='echo \"Starting DAG\"',\n    )\n\n    # Example of a task that would automatically get lineage if it's a supported SQL operator\n    # (e.g., PostgresOperator, BigQueryInsertJobOperator, etc., not shown here for brevity).\n    # The plugin automatically extracts lineage based on OpenLineage events.\n    process_data_task = BashOperator(\n        task_id='process_data_task',\n        bash_command='echo \"Processing data...\" && sleep 5',\n        # For manual lineage, you can use inlets/outlets attributes (table-level only)\n        # inlets={'datasets': [{'platform': 'postgres', 'name': 'mydb.public.source_table'}]},\n        # outlets={'datasets': [{'platform': 'postgres', 'name': 'mydb.public.target_table'}]},\n    )\n\n    end_task = BashOperator(\n        task_id='end_task',\n        bash_command='echo \"DAG finished\"',\n    )\n\n    start_task >> process_data_task >> end_task","lang":"python","description":"This quickstart demonstrates a basic Airflow DAG. Once the `acryl-datahub-airflow-plugin` is installed and a 'DataHub REST Server' connection named `datahub_rest_default` is configured in Airflow, the plugin automatically extracts metadata and lineage for supported operators (like SQL operators or those using native Airflow Datasets/Assets) without explicit Python imports in the DAG code. For custom operators, you might need to use `inlets` and `outlets` or `prepare_lineage`/`apply_lineage` decorators. Ensure your DataHub GMS host is accessible from Airflow."},"warnings":[{"fix":"Upgrade your Python environment to 3.10 or newer.","message":"Python 3.9 support has been dropped; all `acryl-datahub` modules, including the Airflow plugin, now require Python 3.10 or later. Upgrade Python before upgrading the plugin.","severity":"breaking","affected_versions":"<=1.3.x"},{"fix":"Upgrade Airflow to version 2.7+ (or 3.1+ for Airflow 3.x).","message":"The `acryl-datahub-airflow-plugin` has dropped support for Airflow versions less than 2.7. Users on older Airflow versions must upgrade Airflow or pin to an older plugin version.","severity":"breaking","affected_versions":"<=1.3.x"},{"fix":"Remove `DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN=true` configuration. The v2 plugin is now default.","message":"The v1 plugin (`DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN=true`) has been removed. The v2 plugin is now the default. Users explicitly setting `DATAHUB_AIRFLOW_PLUGIN_USE_V1_PLUGIN=true` must upgrade or pin to an older plugin version.","severity":"breaking","affected_versions":">=1.4.0.0"},{"fix":"Downgrade Airflow to 3.1.x or await a plugin update that addresses Airflow 3.2+ compatibility.","message":"The latest DataHub Airflow plugin is not compatible with Airflow 3.2+ due to a deprecated import `airflow.models.mappedoperator`, causing `ModuleNotFoundError`.","severity":"breaking","affected_versions":">=1.5.0.0"},{"fix":"Upgrade Airflow to 3.1.0 or later, or manually upgrade `pydantic` to `>=2.11.8` (may cause dependency conflicts).","message":"Airflow 3.0.6 pins `pydantic==2.11.7`, which contains a bug preventing the DataHub plugin from importing correctly. This issue is resolved in Airflow 3.1.0+ (which uses `pydantic>=2.11.8`).","severity":"gotcha","affected_versions":"All versions with Airflow 3.0.6"},{"fix":"Use the appropriate method (Airflow Variable for 2.x, environment variable for 3.x) to disable the plugin.","message":"The 'kill switch' to disable the plugin differs between Airflow versions. For Airflow 2.x, use an Airflow Variable `datahub_airflow_plugin_disable_listener` set to `true`. For Airflow 3.x, use the environment variable `AIRFLOW_VAR_DATAHUB_AIRFLOW_PLUGIN_DISABLE_LISTENER=true`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"URL-encode the host portion of your Airflow connection string. For example, `http://datahub-gms:8080/api/gms` should be `datahub-rest://datahub-gms:8080%2Fapi%2Fgms` (or if your host *includes* the path, `my-datahub-host.net%2Fapi%2Fgms`).","message":"Errors like 'Unable to emit metadata to DataHub GMS' often stem from incorrect URL encoding in the Airflow DataHub connection string, specifically for the `/api/gms` path. Airflow may not correctly interpret unencoded slashes in connection hosts.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}