dbt-glue adapter for AWS Glue
dbt-glue is a dbt adapter that enables data analysts and engineers to transform data using AWS Glue's Spark engine and interactive sessions. It supports various file formats, including Apache Iceberg, Delta Lake, and Apache Hudi, allowing users to build and manage data pipelines in an AWS data lake environment. The library is actively maintained with frequent updates, aligning with `dbt-core` releases. It is currently at version 1.10.19 and requires Python >=3.9.
Common errors
-
AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table my_table. StorageDescriptor#InputFormat cannot be null for table: my_table
cause This error often occurs when dbt tries to access an Iceberg table on Glue without the necessary `glue_catalog.` prefix, or if Iceberg Spark extensions are not properly configured in the dbt profile.fixEnsure your `profiles.yml` includes the correct `conf` for Iceberg Spark extensions. For custom queries or tests, you may need to explicitly reference tables as `glue_catalog.your_database.your_table`. You can also set `--conf spark.sql.defaultCatalog=glue_catalog` in your profile. -
NameError: name 'null' is not defined
cause When building seeds, especially with CSVs containing nested JSON strings or complex data types, Spark (used by Glue) may misinterpret `null` values or the data format.fixSimplify seed data, especially for testing. Avoid complex structures like nested JSON within CSV seeds if possible. Ensure that the generated Spark code correctly handles `null` values or adjust the seed data format to be simpler for initial ingestion. -
Error in dbt run: Connection timeout for Glue interactive session provisioning
cause The AWS Glue interactive session took too long to provision, often due to high demand, incorrect worker configurations, or network issues.fixIncrease `session_provisioning_timeout_in_seconds` in your `profiles.yml`. Review your `workers` and `worker_type` settings to ensure they are appropriate for your workload and available in your AWS region. Check AWS Glue service quotas and status.
Warnings
- breaking Beginning with dbt Core v1.8, installing a dbt adapter no longer automatically installs `dbt-core`. You must explicitly install both `dbt-core` and `dbt-glue` to avoid missing dependencies or version conflicts.
- gotcha Python model support and Amazon S3 Tables support are currently experimental. They may have limitations or breaking changes in future versions and require AWS Glue 4.0+ for optimal support, and Iceberg file format for Python models.
- gotcha When working with Iceberg tables on AWS Glue, especially in dbt tests or certain queries, you might need to explicitly prefix table names with `glue_catalog.` (e.g., `glue_catalog.your_database.your_table`) in custom SQL or specific configurations if not handled automatically by the adapter's macros.
- deprecated dbt Core v1.10 introduces deprecation warnings for several patterns, including duplicate keys in the same YAML file, unexpected Jinja blocks, and the `--models` / `--model` / `-m` CLI flags (which were renamed to `--select` / `--s` in v0.21).
- gotcha Incorrect IAM permissions for the Glue interactive session role can lead to `AccessDeniedException` errors, preventing dbt-glue from accessing S3 buckets or the Glue Data Catalog.
Install
-
pip install dbt-core dbt-glue
Imports
- dbt.config
def model(dbt, spark): dbt.config(materialized='python_model', file_format='iceberg') - dbt.ref
source_df = dbt.ref("my_sql_model") - dbt.source
raw_data = dbt.source('my_source', 'my_table')
Quickstart
import os
# profiles.yml example for dbt-glue
profiles_yml_content = """
dbt_glue_project:
target: dev
outputs:
dev:
type: glue
query-comment: dbt-glue-example
role_arn: "{{ env_var('DBT_ROLE_ARN', 'arn:aws:iam::123456789012:role/GlueInteractiveSessionRole') }}"
region: "{{ env_var('AWS_REGION', 'us-east-1') }}"
workers: 5
worker_type: G.1X
schema: dbt_glue_demo_schema
database: dbt_glue_demo_db
session_provisioning_timeout_in_seconds: 120
location: "{{ env_var('DBT_S3_LOCATION', 's3://your-s3-bucket/dbt-glue/') }}"
glue_version: "4.0"
conf: "--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=DbtGlueLockTable --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
"""
# Example dbt_project.yml (assuming project name 'dbt_glue_project')
dbt_project_yml_content = """
name: 'dbt_glue_project'
version: '1.0.0'
config-version: 2
profile: 'dbt_glue_project'
model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]
target-path: "target" # directory which will store compiled SQL files
clean-targets:
- "target"
- "dbt_packages"
models:
dbt_glue_project:
+materialized: table
"""
# Example SQL model (models/my_first_model.sql)
sql_model_content = """
{{ config(materialized='table', file_format='parquet') }}
SELECT
1 as id,
'dbt-glue' as name
"""
# Example Python model (models/my_python_model.py) - requires AWS Glue 4.0+ and Iceberg
python_model_content = """
from pyspark.sql.functions import lit
def model(dbt, spark):
dbt.config(
materialized='incremental',
file_format='iceberg',
unique_key=['id'],
incremental_strategy='merge'
)
if dbt.is_incremental():
max_id_query = f"SELECT coalesce(max(id), 0) FROM {dbt.this}"
max_id = spark.sql(max_id_query).collect()[0][0]
return spark.createDataFrame([(max_id + 1, 'new_incremental_record')]) \
.toDF("id", "name")
else:
return spark.createDataFrame([(1, 'initial_record'), (2, 'another_initial')]) \
.toDF("id", "name")
"""
# To run:
# 1. Ensure AWS credentials and DBT_ROLE_ARN, DBT_S3_LOCATION environment variables are set.
# 2. Create the project structure: ~/.dbt/profiles.yml, dbt_project.yml, models/my_first_model.sql, models/my_python_model.py
# 3. dbt debug
# 4. dbt run