Apache Airflow Druid Provider

4.5.2 · active · verified Thu Apr 16

The Apache Airflow Druid Provider enables Airflow to interact with Apache Druid, a high-performance, real-time analytics database. It includes hooks and operators for executing Druid queries and loading data. The current version is 4.5.2, and Airflow providers generally follow Airflow's release cadence, with updates for new features and bug fixes.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create an Airflow DAG that uses the `DruidOperator` to execute both SQL and native JSON queries against a Druid cluster. It assumes a Druid connection named `druid_default` is configured in Airflow. The examples show querying a 'wikipedia' datasource.

import os
from datetime import datetime

from airflow.models.dag import DAG
from airflow.providers.apache.druid.operators.druid import DruidOperator

# Ensure you have an Airflow connection named 'druid_default'
# Or set one up: airflow connections add --conn-id druid_default --conn-type Druid --conn-host localhost --conn-port 8082
# For local testing, Druid usually runs on localhost:8082 by default
DRUID_CONN_ID = os.environ.get('AIRFLOW_DRUID_CONN_ID', 'druid_default')

with DAG(
    dag_id='druid_example_dag',
    start_date=datetime(2023, 1, 1),
    schedule=None,
    catchup=False,
    tags=['druid', 'example'],
) as dag:
    # Example of a simple Druid SQL query
    run_druid_sql_query = DruidOperator(
        task_id='run_druid_sql_query',
        druid_conn_id=DRUID_CONN_ID,
        sql_query="SELECT COUNT(*) FROM wikipedia WHERE __time >= CURRENT_TIMESTAMP - INTERVAL '1' DAY",
    )

    # Example of a more complex Druid JSON query
    # This assumes a 'wikipedia' datasource exists
    run_druid_json_query = DruidOperator(
        task_id='run_druid_json_query',
        druid_conn_id=DRUID_CONN_ID,
        json_query={
            "queryType": "timeseries",
            "dataSource": "wikipedia",
            "granularity": "day",
            "intervals": ["2016-06-01T00:00:00.000Z/2016-06-02T00:00:00.000Z"],
            "aggregations": [
                {"type": "count", "name": "total_events"}
            ]
        },
    )

    run_druid_sql_query >> run_druid_json_query

view raw JSON →