dbt-glue adapter for AWS Glue

1.10.19 · active · verified Thu Apr 16

dbt-glue is a dbt adapter that enables data analysts and engineers to transform data using AWS Glue's Spark engine and interactive sessions. It supports various file formats, including Apache Iceberg, Delta Lake, and Apache Hudi, allowing users to build and manage data pipelines in an AWS data lake environment. The library is actively maintained with frequent updates, aligning with `dbt-core` releases. It is currently at version 1.10.19 and requires Python >=3.9.

Common errors

Warnings

Install

Imports

Quickstart

To get started with `dbt-glue`, you'll need to configure your `profiles.yml` to specify connection details for AWS Glue interactive sessions, including the IAM role, region, worker types, and S3 location. SQL models define transformations, and experimental Python models allow for more complex logic using PySpark DataFrames. Ensure `DBT_ROLE_ARN` and `DBT_S3_LOCATION` environment variables are set for authentication and storage paths respectively.

import os

# profiles.yml example for dbt-glue
profiles_yml_content = """
dbt_glue_project:
  target: dev
  outputs:
    dev:
      type: glue
      query-comment: dbt-glue-example
      role_arn: "{{ env_var('DBT_ROLE_ARN', 'arn:aws:iam::123456789012:role/GlueInteractiveSessionRole') }}"
      region: "{{ env_var('AWS_REGION', 'us-east-1') }}"
      workers: 5
      worker_type: G.1X
      schema: dbt_glue_demo_schema
      database: dbt_glue_demo_db
      session_provisioning_timeout_in_seconds: 120
      location: "{{ env_var('DBT_S3_LOCATION', 's3://your-s3-bucket/dbt-glue/') }}"
      glue_version: "4.0"
      conf: "--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=DbtGlueLockTable --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions"
"""

# Example dbt_project.yml (assuming project name 'dbt_glue_project')
dbt_project_yml_content = """
name: 'dbt_glue_project'
version: '1.0.0'
config-version: 2

profile: 'dbt_glue_project'

model-paths: ["models"]
analysis-paths: ["analyses"]
test-paths: ["tests"]
seed-paths: ["seeds"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target"  # directory which will store compiled SQL files
clean-targets:
  - "target"
  - "dbt_packages"

models:
  dbt_glue_project:
    +materialized: table
"""

# Example SQL model (models/my_first_model.sql)
sql_model_content = """
{{ config(materialized='table', file_format='parquet') }}

SELECT
  1 as id,
  'dbt-glue' as name
"""

# Example Python model (models/my_python_model.py) - requires AWS Glue 4.0+ and Iceberg
python_model_content = """
from pyspark.sql.functions import lit

def model(dbt, spark):
    dbt.config(
        materialized='incremental',
        file_format='iceberg',
        unique_key=['id'],
        incremental_strategy='merge'
    )

    if dbt.is_incremental():
        max_id_query = f"SELECT coalesce(max(id), 0) FROM {dbt.this}"
        max_id = spark.sql(max_id_query).collect()[0][0]
        return spark.createDataFrame([(max_id + 1, 'new_incremental_record')]) \
               .toDF("id", "name")
    else:
        return spark.createDataFrame([(1, 'initial_record'), (2, 'another_initial')]) \
               .toDF("id", "name")
"""

# To run: 
# 1. Ensure AWS credentials and DBT_ROLE_ARN, DBT_S3_LOCATION environment variables are set.
# 2. Create the project structure: ~/.dbt/profiles.yml, dbt_project.yml, models/my_first_model.sql, models/my_python_model.py
# 3. dbt debug
# 4. dbt run

view raw JSON →