{"id":1987,"library":"dbt-spark","title":"dbt-spark","description":"dbt-spark is the Apache Spark adapter plugin for dbt (data build tool), enabling data analysts and engineers to transform data in Apache Spark using SQL. It leverages Spark's distributed computing capabilities for efficient data transformation. The current version is 1.10.1, and it typically releases new versions in alignment with `dbt-core`'s major and minor releases.","status":"active","version":"1.10.1","language":"en","source_language":"en","source_url":"https://github.com/dbt-labs/dbt-adapters/tree/main/dbt-spark","tags":["dbt","spark","etl","data transformation","adapter","analytics engineering"],"install":[{"cmd":"pip install dbt-core dbt-spark","lang":"bash","label":"Base Installation"},{"cmd":"pip install \"dbt-spark[ODBC]\"","lang":"bash","label":"For ODBC connections"},{"cmd":"pip install \"dbt-spark[PyHive]\"","lang":"bash","label":"For Thrift or HTTP connections"}],"dependencies":[{"reason":"Essential for dbt functionality; minor versions of dbt-spark and dbt-core should match for compatibility.","package":"dbt-core","optional":false},{"reason":"Required for connecting to Spark via ODBC driver.","package":"pyodbc","optional":true},{"reason":"Required for connecting to Spark via Thrift or HTTP methods.","package":"PyHive","optional":true},{"reason":"Often used for session connections or when running Spark locally.","package":"pyspark","optional":true}],"imports":[{"note":"dbt adapters are loaded by dbt-core internally; users interact with them through dbt commands and profile configurations.","symbol":"dbt-spark","correct":"dbt-spark is used primarily via the dbt CLI and configuration files (profiles.yml), not via direct Python import statements in user projects."}],"quickstart":{"code":"import os\n\n# This quickstart demonstrates configuring dbt-spark with a local Spark Thrift server.\n# First, ensure you have Docker installed and the dbt-spark local environment set up.\n# From the dbt-adapters/dbt-spark directory, run:\n# docker-compose up -d\n\n# Create a profiles.yml file in your dbt project's ~/.dbt/ directory or project root\nprofiles_content = '''\nspark_local_dev:\n  target: dev\n  outputs:\n    dev:\n      type: spark\n      method: thrift\n      host: 127.0.0.1\n      port: 10000\n      user: dbt\n      schema: analytics\n      connect_retries: 5\n      connect_timeout: 60\n      retry_all: true\n'''\n\n# For demonstration, we'll write it to a temporary location\n# In a real scenario, this goes to ~/.dbt/profiles.yml\n# or in your dbt project folder directly.\nprofile_path = os.path.expanduser('~/.dbt/profiles.yml') # For a real setup\n# Or for a quick test in a temporary project directory:\n# profile_path = 'dbt_project/profiles.yml'\n\n# Ensure the directory exists if writing to ~/.dbt/\nos.makedirs(os.path.dirname(profile_path), exist_ok=True)\n\nwith open(profile_path, 'w') as f:\n    f.write(profiles_content)\n\nprint(f\"profiles.yml created at {profile_path} (or its content suggested for it).\")\nprint(\"Next, initialize a dbt project: dbt init my_spark_project\")\nprint(\"Select 'spark_local_dev' as your profile when prompted.\")\nprint(\"Then, create a model, e.g., models/my_model.sql:\")\nprint(\"---\\nSELECT 1 AS id, 'hello dbt-spark' AS message\\n---\")\nprint(\"Run your dbt models: dbt run --profile spark_local_dev\")\n","lang":"python","description":"This quickstart outlines the `profiles.yml` configuration for connecting dbt to a local Spark Thrift server, often set up via docker-compose (as demonstrated in the dbt-spark repository README). It also suggests a basic SQL model for validation."},"warnings":[{"fix":"Always install `dbt-core` and `dbt-spark` with matching minor versions: `pip install dbt-core==X.Y.Z dbt-spark==X.Y.Z`.","message":"The minor versions of `dbt-spark` and `dbt-core` must match for correct dependency resolution and functionality (e.g., `dbt-spark==1.9.x` requires `dbt-core==1.9.x`). Mixing versions can lead to errors.","severity":"breaking","affected_versions":"All versions"},{"fix":"Migrate your project from `dbt-spark` to `dbt-databricks`. Install `dbt-databricks` and update your `profiles.yml` configuration.","message":"For Databricks users, the `dbt-databricks` adapter is now the recommended choice over `dbt-spark`, offering easier setup, Unity Catalog support, and better defaults. Migration is advised.","severity":"deprecated","affected_versions":"All versions when using Databricks"},{"fix":"Explicitly set `incremental_strategy: 'merge'` or `incremental_strategy: 'append'` in your incremental models to ensure consistent behavior across adapters.","message":"The default `incremental_strategy` for `dbt-spark` is `append`, whereas for the `dbt-databricks` adapter, it defaults to `merge`. This can lead to different behavior in incremental models if migrating or using both adapters.","severity":"gotcha","affected_versions":"All versions, especially when considering migration to dbt-databricks"},{"fix":"Manually create the database in Spark (e.g., `CREATE DATABASE your_schema_name;`) before running dbt. Also, ensure the `default` namespace exists when using Thrift.","message":"When connecting to a Spark Thrift server, ensure the target `schema` (database) specified in `profiles.yml` already exists in Spark. If it doesn't, dbt will raise a 'Cannot set database in spark!' runtime error.","severity":"gotcha","affected_versions":"All versions using Thrift connections"},{"fix":"Consider using dedicated schemas for dbt models with fewer tables or exploring adapter-specific workarounds (if available) for `list_relations_without_caching` limitations.","message":"Using `dbt-spark` with a schema containing a large number of tables (e.g., thousands) can lead to extremely slow `dbt run` parsing times. This is due to Spark's lack of an information schema layer, forcing dbt to 'discover' all tables.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}