Databricks Provider for Apache Airflow

7.12.0 · active · verified Sun Mar 29

The `apache-airflow-providers-databricks` package provides operators, hooks, and sensors to interact with Databricks, a unified data and analytics platform. It allows users to orchestrate Databricks notebooks, jobs, and SQL queries from Apache Airflow. The provider is actively maintained by the Apache Airflow community and is currently on version 7.12.0, with regular updates to support new Databricks features and Airflow versions.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to use the `DatabricksRunNowOperator` to trigger an existing Databricks job from Airflow. It requires an Airflow connection named `databricks_default` to be configured with your Databricks workspace URL and a Personal Access Token (PAT). Replace `YOUR_DATABRICKS_JOB_ID` with the ID of an existing Databricks job.

from __future__ import annotations

import os
import pendulum

from airflow.models.dag import DAG
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator

# Configure your Databricks connection in Airflow UI (Admin -> Connections)
# Conn Id: 'databricks_default'
# Conn Type: 'Databricks'
# Host: Your Databricks workspace URL (e.g., https://adb-xxxxxxxx.xx.databricks.com)
# Password: Your Databricks Personal Access Token (PAT) (recommended over username/password)
# See https://airflow.apache.org/docs/apache-airflow-providers-databricks/stable/connections/databricks.html

with DAG(
    dag_id="databricks_run_notebook_example",
    start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
    schedule=None,
    catchup=False,
    tags=["databricks", "example"],
    doc_md="""
    ### Databricks Run Notebook Example DAG
    This DAG demonstrates how to use `DatabricksRunNowOperator` to execute
    an existing Databricks job (which might run a notebook or JAR).

    **Prerequisites:**
    1. An Airflow connection named `databricks_default` configured with your Databricks workspace details and PAT.
    2. An existing Databricks job. Replace `YOUR_DATABRICKS_JOB_ID` with your actual Job ID.
    """,
) as dag:
    run_databricks_job = DatabricksRunNowOperator(
        task_id="run_existing_databricks_job",
        databricks_conn_id="databricks_default",
        job_id="YOUR_DATABRICKS_JOB_ID", # Replace with your Databricks Job ID
        # You can pass optional parameters to the job run
        # notebook_params={"input_param": "value_from_airflow"},
        # or spark_jar_params, spark_python_task_params, etc. depending on job type
    )

view raw JSON →