{"id":6581,"library":"dagster-databricks","title":"Dagster Databricks Integration","description":"The `dagster-databricks` library provides Dagster ops and resources for interacting with Databricks. This includes resources for connecting to Databricks workspaces and ops for running Databricks jobs (notebooks, JARs, Python scripts) or PySpark jobs. As a Dagster library, its version (currently 0.29.0) is released in lockstep with the core `dagster` library (1.13.0), typically on a weekly or bi-weekly cadence.","status":"active","version":"0.29.0","language":"en","source_language":"en","source_url":"https://github.com/dagster-io/dagster/tree/master/python_modules/libraries/dagster-databricks","tags":["data orchestration","etl","databricks","spark","notebooks","dagster","mlops"],"install":[{"cmd":"pip install dagster-databricks","lang":"bash","label":"Install dagster-databricks"}],"dependencies":[{"reason":"Core Dagster framework is required as this is an integration library.","package":"dagster","optional":false},{"reason":"Provides the underlying client for interacting with the Databricks API.","package":"databricks-sdk","optional":false}],"imports":[{"symbol":"databricks_resource","correct":"from dagster_databricks import databricks_resource"},{"symbol":"databricks_pyspark_resource","correct":"from dagster_databricks import databricks_pyspark_resource"},{"symbol":"databricks_op","correct":"from dagster_databricks import databricks_op"},{"note":"Less commonly imported directly by users, primarily used within resources or for advanced programmatic access.","symbol":"DatabricksClient","correct":"from dagster_databricks import DatabricksClient"}],"quickstart":{"code":"import os\nfrom dagster import Definitions, job\nfrom dagster_databricks import databricks_resource, databricks_op\n\n# Configure Databricks resource using environment variables\n# DATABRICKS_HOST should be 'https://<workspace-url>.cloud.databricks.com'\n# DATABRICKS_TOKEN is your personal access token\ndatabricks_config = {\n    \"host\": os.environ.get(\"DATABRICKS_HOST\", \"\"),\n    \"token\": os.environ.get(\"DATABRICKS_TOKEN\", \"\")\n}\n\n# Create a configured Databricks resource\nconfigured_databricks_resource = databricks_resource.configured(databricks_config)\n\n@databricks_op(\n    name=\"run_example_databricks_notebook\",\n    # These parameters map directly to the Databricks Jobs API 'notebook_task'\n    # Replace with your actual notebook path and cluster configuration\n    notebook_task={\n        \"notebook_path\": \"/Users/your_user@example.com/my_dagster_notebook\"\n    },\n    new_cluster={\n        \"spark_version\": \"12.2.x-scala2.12\",\n        \"node_type_id\": \"i3.xlarge\",\n        \"num_workers\": 1\n    }\n)\ndef run_databricks_notebook(context):\n    \"\"\"An op to run a Databricks notebook job.\"\"\"\n    context.log.info(\"Databricks notebook job submitted via Dagster.\")\n\n@job\ndef databricks_example_job():\n    run_databricks_notebook()\n\n# Define a repository using Definitions\ndefs = Definitions(\n    jobs=[databricks_example_job],\n    resources={\n        \"databricks_resource\": configured_databricks_resource\n    }\n)\n\n# To run this example:\n# 1. Set DATABRICKS_HOST and DATABRICKS_TOKEN environment variables.\n# 2. Ensure '/Users/your_user@example.com/my_dagster_notebook' exists in your Databricks workspace.\n# 3. Save this code as a Python file (e.g., `databricks_repo.py`).\n# 4. Run `dagster dev -f databricks_repo.py` and launch the job from Dagit.","lang":"python","description":"This quickstart demonstrates how to define a Dagster job that uses `dagster-databricks` to execute a Databricks notebook. It sets up the `databricks_resource` using environment variables for host and token, and uses `databricks_op` to specify the notebook task and cluster configuration. To make this runnable, you must set the `DATABRICKS_HOST` and `DATABRICKS_TOKEN` environment variables and provide a valid `notebook_path` that exists in your Databricks workspace."},"warnings":[{"fix":"Always upgrade `dagster` and all `dagster-*` libraries simultaneously to their compatible versions as listed in Dagster release notes or dependency manifests.","message":"The `dagster-databricks` library versions are tied to the core `dagster` library versions. For example, `dagster-databricks==0.29.0` is compatible with `dagster==1.13.0`. Upgrading one without the other can lead to import errors or runtime issues due to API changes.","severity":"breaking","affected_versions":"<0.29.0"},{"fix":"Ensure `DATABRICKS_HOST` is the full workspace URL (e.g., `https://dbc-xxxx.cloud.databricks.com`) and `DATABRICKS_TOKEN` is a valid Databricks personal access token with sufficient permissions (e.g., 'Can Manage' for jobs and clusters).","message":"Databricks authentication requires correctly configuring `DATABRICKS_HOST` and `DATABRICKS_TOKEN` (or equivalent secrets). Incorrect host URLs (e.g., missing `.cloud.databricks.com` suffix) or invalid personal access tokens are common setup errors.","severity":"gotcha","affected_versions":"All"},{"fix":"Migrate from `databricks_pyspark_step_launcher` to using `databricks_pyspark_resource` as the primary way to define PySpark jobs within Dagster. Refer to the official documentation for migration guides if applicable.","message":"The `databricks_pyspark_step_launcher` was deprecated and replaced by the more flexible `databricks_pyspark_resource`. If you are upgrading from very old `dagster-databricks` versions, your PySpark job definitions will need to be updated.","severity":"deprecated","affected_versions":"<0.14.0"},{"fix":"Consult the official Databricks Jobs API documentation for the exact schema required for job tasks and cluster configurations. Ensure your `databricks_op` parameters mirror this structure precisely.","message":"Parameters passed to `databricks_op` (e.g., `notebook_task`, `new_cluster`, `spark_jar_task`, `python_wheel_task`) must conform to the Databricks Jobs API specification. Small discrepancies in key names or structure can lead to job submission failures.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}