Databricks Provider for Apache Airflow

raw JSON →
7.12.0 verified Tue May 12 auth: no python install: stale

The `apache-airflow-providers-databricks` package provides operators, hooks, and sensors to interact with Databricks, a unified data and analytics platform. It allows users to orchestrate Databricks notebooks, jobs, and SQL queries from Apache Airflow. The provider is actively maintained by the Apache Airflow community and is currently on version 7.12.0, with regular updates to support new Databricks features and Airflow versions.

pip install apache-airflow-providers-databricks
error ModuleNotFoundError: No module named 'airflow.providers.databricks'
cause The 'apache-airflow-providers-databricks' package is not installed in the Airflow environment.
fix
Install the package using 'pip install apache-airflow-providers-databricks' and restart the Airflow services.
error ModuleNotFoundError: No module named 'airflow.providers.databricks.operators.databricks_sql'
cause The 'apache-airflow-providers-databricks' package is not installed or not accessible in the Airflow environment.
fix
Ensure the package is installed using 'pip install apache-airflow-providers-databricks' and restart the Airflow services.
error Databricks connection type not showing in Airflow
cause The 'apache-airflow-providers-databricks' package is not installed in the Airflow webserver container.
fix
Install the package in the webserver container using 'pip install apache-airflow-providers-databricks' and restart the Airflow services.
error Invalid Access Token : 403 Forbidden Error
cause This error typically indicates an issue with Databricks authentication, such as an incorrect, expired, or improperly configured Personal Access Token (PAT) or Service Principal credentials in the Airflow connection.
fix
Verify that the Databricks connection in Airflow (usually databricks_default) has the correct host URL (e.g., https://your-workspace.cloud.databricks.com), the Login field is set to 'token', and the Password field contains a valid, unexpired Databricks Personal Access Token. For some configurations, the token might also need to be specified in the 'Extra' field as {"token": "your_pat"}. If using service principals, ensure the client ID and secret are correctly configured.
error Databricks job cancelled before completion (Airflow task timeout)
cause The Airflow task or operator (e.g., `DatabricksRunNowOperator`) has a timeout configured that is shorter than the actual execution time of the Databricks job, causing Airflow to cancel the job prematurely.
fix
Increase the timeout_seconds parameter within your DatabricksRunNowOperator or DatabricksSubmitRunOperator in your Airflow DAG to a value greater than the expected Databricks job runtime. Alternatively, check the execution_timeout parameter at the task or DAG level in Airflow.
breaking Minimum Airflow version requirement for the provider package increases with new releases. For example, provider version 3.0.0 required Airflow 2.2.0+, and provider 7.8.0+ requires Airflow 2.11.0+. Installing a newer provider version on an older Airflow installation may lead to Airflow core being auto-upgraded by pip, causing dependency conflicts.
fix Always check the provider's `__{init}__.py` or documentation for the `min_airflow_version` before upgrading. Ensure your Airflow environment meets or exceeds this requirement.
deprecated The `astro-provider-databricks` (Astronomer's Databricks provider) has been deprecated. Its features were integrated into the official `apache-airflow-providers-databricks` package from version 6.8.0. Using the deprecated provider may lead to missing features, bugs, or security vulnerabilities.
fix Migrate your DAGs to use `from airflow.providers.databricks.<module> import ...` import paths. Uninstall `astro-provider-databricks` and install `apache-airflow-providers-databricks`.
breaking The behavior of `DatabricksSqlHook.run()` changed in pre-4.x versions to 4.x. Previously, it returned a tuple of `("cursor description", "results")`. In 4.x and later, it now returns only `"results"` to conform with other `DBApiHook` implementations. Custom hooks or TaskFlow code relying on the old behavior will break.
fix If you relied on the cursor description, retrieve it via `hook.last_description` after the `run` method completes. Adapt your code to expect only the results array from `run()`.
gotcha Authentication using Databricks username and password is discouraged and not supported for some operators like `DatabricksSqlOperator`. This method is less secure and flexible than using Personal Access Tokens (PATs) or OAuth with Service Principals.
fix Use a Databricks Personal Access Token (PAT) for authentication, preferably a PAT belonging to a Databricks Service Principal. Configure it in the Airflow connection's 'Password' field or 'Extra' JSON as `{"token": "YOUR_PAT"}`.
gotcha When configuring a Databricks connection in Airflow, ensure the 'Host' field contains the full Databricks workspace URL (e.g., `https://adb-xxxxxxxx.xx.databricks.com/`). Missing or incorrect host can lead to connection errors.
fix Verify the 'Host' in your Airflow Databricks connection is the complete workspace URL, copied directly from your Databricks instance.
breaking Building certain Python packages (e.g., `lz4`, `numpy`, `psycopg2`) that include C extensions requires a C compiler and other build tools. In `alpine`-based Python images, these tools are often missing by default, leading to `command 'gcc' failed: No such file or directory` or similar build errors during installation.
fix For `alpine`-based Docker images, install the `build-base` package: `apk add build-base`. For Debian/Ubuntu, install `build-essential`: `apt-get update && apt-get install -y build-essential`. Always include these commands in your Dockerfile before `pip install` commands that might build C extensions.
python os / libc status wheel install import disk
3.10 alpine (musl) build_error - - - -
3.10 alpine (musl) - - - -
3.10 slim (glibc) sdist 36.9s 4.46s 554M
3.10 slim (glibc) - - 3.96s 543M
3.11 alpine (musl) build_error - - - -
3.11 alpine (musl) - - - -
3.11 slim (glibc) sdist 35.4s 6.98s 598M
3.11 slim (glibc) - - 6.04s 586M
3.12 alpine (musl) build_error - - - -
3.12 alpine (musl) - - - -
3.12 slim (glibc) sdist 31.6s 7.13s 579M
3.12 slim (glibc) - - 6.60s 568M
3.13 alpine (musl) build_error - - - -
3.13 alpine (musl) - - - -
3.13 slim (glibc) sdist 30.3s 6.78s 581M
3.13 slim (glibc) - - 6.51s 569M
3.9 alpine (musl) build_error - 0.1s - -
3.9 alpine (musl) - - - -
3.9 slim (glibc) timeout - - - -
3.9 slim (glibc) - - - -

This quickstart demonstrates how to use the `DatabricksRunNowOperator` to trigger an existing Databricks job from Airflow. It requires an Airflow connection named `databricks_default` to be configured with your Databricks workspace URL and a Personal Access Token (PAT). Replace `YOUR_DATABRICKS_JOB_ID` with the ID of an existing Databricks job.

from __future__ import annotations

import os
import pendulum

from airflow.models.dag import DAG
from airflow.providers.databricks.operators.databricks import DatabricksRunNowOperator

# Configure your Databricks connection in Airflow UI (Admin -> Connections)
# Conn Id: 'databricks_default'
# Conn Type: 'Databricks'
# Host: Your Databricks workspace URL (e.g., https://adb-xxxxxxxx.xx.databricks.com)
# Password: Your Databricks Personal Access Token (PAT) (recommended over username/password)
# See https://airflow.apache.org/docs/apache-airflow-providers-databricks/stable/connections/databricks.html

with DAG(
    dag_id="databricks_run_notebook_example",
    start_date=pendulum.datetime(2023, 1, 1, tz="UTC"),
    schedule=None,
    catchup=False,
    tags=["databricks", "example"],
    doc_md="""
    ### Databricks Run Notebook Example DAG
    This DAG demonstrates how to use `DatabricksRunNowOperator` to execute
    an existing Databricks job (which might run a notebook or JAR).

    **Prerequisites:**
    1. An Airflow connection named `databricks_default` configured with your Databricks workspace details and PAT.
    2. An existing Databricks job. Replace `YOUR_DATABRICKS_JOB_ID` with your actual Job ID.
    """,
) as dag:
    run_databricks_job = DatabricksRunNowOperator(
        task_id="run_existing_databricks_job",
        databricks_conn_id="databricks_default",
        job_id="YOUR_DATABRICKS_JOB_ID", # Replace with your Databricks Job ID
        # You can pass optional parameters to the job run
        # notebook_params={"input_param": "value_from_airflow"},
        # or spark_jar_params, spark_python_task_params, etc. depending on job type
    )