dbt-spark
dbt-spark is the Apache Spark adapter plugin for dbt (data build tool), enabling data analysts and engineers to transform data in Apache Spark using SQL. It leverages Spark's distributed computing capabilities for efficient data transformation. The current version is 1.10.1, and it typically releases new versions in alignment with `dbt-core`'s major and minor releases.
Warnings
- breaking The minor versions of `dbt-spark` and `dbt-core` must match for correct dependency resolution and functionality (e.g., `dbt-spark==1.9.x` requires `dbt-core==1.9.x`). Mixing versions can lead to errors.
- deprecated For Databricks users, the `dbt-databricks` adapter is now the recommended choice over `dbt-spark`, offering easier setup, Unity Catalog support, and better defaults. Migration is advised.
- gotcha The default `incremental_strategy` for `dbt-spark` is `append`, whereas for the `dbt-databricks` adapter, it defaults to `merge`. This can lead to different behavior in incremental models if migrating or using both adapters.
- gotcha When connecting to a Spark Thrift server, ensure the target `schema` (database) specified in `profiles.yml` already exists in Spark. If it doesn't, dbt will raise a 'Cannot set database in spark!' runtime error.
- gotcha Using `dbt-spark` with a schema containing a large number of tables (e.g., thousands) can lead to extremely slow `dbt run` parsing times. This is due to Spark's lack of an information schema layer, forcing dbt to 'discover' all tables.
Install
-
pip install dbt-core dbt-spark -
pip install "dbt-spark[ODBC]" -
pip install "dbt-spark[PyHive]"
Imports
- dbt-spark
dbt-spark is used primarily via the dbt CLI and configuration files (profiles.yml), not via direct Python import statements in user projects.
Quickstart
import os
# This quickstart demonstrates configuring dbt-spark with a local Spark Thrift server.
# First, ensure you have Docker installed and the dbt-spark local environment set up.
# From the dbt-adapters/dbt-spark directory, run:
# docker-compose up -d
# Create a profiles.yml file in your dbt project's ~/.dbt/ directory or project root
profiles_content = '''
spark_local_dev:
target: dev
outputs:
dev:
type: spark
method: thrift
host: 127.0.0.1
port: 10000
user: dbt
schema: analytics
connect_retries: 5
connect_timeout: 60
retry_all: true
'''
# For demonstration, we'll write it to a temporary location
# In a real scenario, this goes to ~/.dbt/profiles.yml
# or in your dbt project folder directly.
profile_path = os.path.expanduser('~/.dbt/profiles.yml') # For a real setup
# Or for a quick test in a temporary project directory:
# profile_path = 'dbt_project/profiles.yml'
# Ensure the directory exists if writing to ~/.dbt/
os.makedirs(os.path.dirname(profile_path), exist_ok=True)
with open(profile_path, 'w') as f:
f.write(profiles_content)
print(f"profiles.yml created at {profile_path} (or its content suggested for it).")
print("Next, initialize a dbt project: dbt init my_spark_project")
print("Select 'spark_local_dev' as your profile when prompted.")
print("Then, create a model, e.g., models/my_model.sql:")
print("---\nSELECT 1 AS id, 'hello dbt-spark' AS message\n---")
print("Run your dbt models: dbt run --profile spark_local_dev")