{"id":7146,"library":"dbt-glue","title":"dbt-glue adapter for AWS Glue","description":"dbt-glue is a dbt adapter that enables data analysts and engineers to transform data using AWS Glue's Spark engine and interactive sessions. It supports various file formats, including Apache Iceberg, Delta Lake, and Apache Hudi, allowing users to build and manage data pipelines in an AWS data lake environment. The library is actively maintained with frequent updates, aligning with `dbt-core` releases. It is currently at version 1.10.19 and requires Python >=3.9.","status":"active","version":"1.10.19","language":"en","source_language":"en","source_url":"https://github.com/aws-samples/dbt-glue","tags":["dbt","AWS Glue","data transformation","ETL","data lake","Spark","Iceberg","Delta Lake","Hudi"],"install":[{"cmd":"pip install dbt-core dbt-glue","lang":"bash","label":"Install dbt-glue with dbt-core"}],"dependencies":[{"reason":"dbt-glue is an adapter for dbt Core, which provides the main CLI and framework.","package":"dbt-core","optional":false},{"reason":"dbt-glue leverages dbt-spark for its underlying Spark compatibility, especially in earlier versions.","package":"dbt-spark","optional":true}],"imports":[{"note":"Used within Python models for model configuration.","symbol":"dbt.config","correct":"def model(dbt, spark):\n    dbt.config(materialized='python_model', file_format='iceberg')"},{"note":"Used in dbt SQL and Python models to reference other dbt models.","symbol":"dbt.ref","correct":"source_df = dbt.ref(\"my_sql_model\")"},{"note":"Used in dbt SQL and Python models to reference declared data sources.","symbol":"dbt.source","correct":"raw_data = dbt.source('my_source', 'my_table')"}],"quickstart":{"code":"import os\n\n# profiles.yml example for dbt-glue\nprofiles_yml_content = \"\"\"\ndbt_glue_project:\n  target: dev\n  outputs:\n    dev:\n      type: glue\n      query-comment: dbt-glue-example\n      role_arn: \"{{ env_var('DBT_ROLE_ARN', 'arn:aws:iam::123456789012:role/GlueInteractiveSessionRole') }}\"\n      region: \"{{ env_var('AWS_REGION', 'us-east-1') }}\"\n      workers: 5\n      worker_type: G.1X\n      schema: dbt_glue_demo_schema\n      database: dbt_glue_demo_db\n      session_provisioning_timeout_in_seconds: 120\n      location: \"{{ env_var('DBT_S3_LOCATION', 's3://your-s3-bucket/dbt-glue/') }}\"\n      glue_version: \"4.0\"\n      conf: \"--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog --conf spark.sql.catalog.glue_catalog.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog --conf spark.sql.catalog.glue_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO --conf spark.sql.catalog.glue_catalog.lock-impl=org.apache.iceberg.aws.glue.DynamoLockManager --conf spark.sql.catalog.glue_catalog.lock.table=DbtGlueLockTable --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions\"\n\"\"\"\n\n# Example dbt_project.yml (assuming project name 'dbt_glue_project')\ndbt_project_yml_content = \"\"\"\nname: 'dbt_glue_project'\nversion: '1.0.0'\nconfig-version: 2\n\nprofile: 'dbt_glue_project'\n\nmodel-paths: [\"models\"]\nanalysis-paths: [\"analyses\"]\ntest-paths: [\"tests\"]\nseed-paths: [\"seeds\"]\nmacro-paths: [\"macros\"]\nsnapshot-paths: [\"snapshots\"]\n\ntarget-path: \"target\"  # directory which will store compiled SQL files\nclean-targets:\n  - \"target\"\n  - \"dbt_packages\"\n\nmodels:\n  dbt_glue_project:\n    +materialized: table\n\"\"\"\n\n# Example SQL model (models/my_first_model.sql)\nsql_model_content = \"\"\"\n{{ config(materialized='table', file_format='parquet') }}\n\nSELECT\n  1 as id,\n  'dbt-glue' as name\n\"\"\"\n\n# Example Python model (models/my_python_model.py) - requires AWS Glue 4.0+ and Iceberg\npython_model_content = \"\"\"\nfrom pyspark.sql.functions import lit\n\ndef model(dbt, spark):\n    dbt.config(\n        materialized='incremental',\n        file_format='iceberg',\n        unique_key=['id'],\n        incremental_strategy='merge'\n    )\n\n    if dbt.is_incremental():\n        max_id_query = f\"SELECT coalesce(max(id), 0) FROM {dbt.this}\"\n        max_id = spark.sql(max_id_query).collect()[0][0]\n        return spark.createDataFrame([(max_id + 1, 'new_incremental_record')]) \\\n               .toDF(\"id\", \"name\")\n    else:\n        return spark.createDataFrame([(1, 'initial_record'), (2, 'another_initial')]) \\\n               .toDF(\"id\", \"name\")\n\"\"\"\n\n# To run: \n# 1. Ensure AWS credentials and DBT_ROLE_ARN, DBT_S3_LOCATION environment variables are set.\n# 2. Create the project structure: ~/.dbt/profiles.yml, dbt_project.yml, models/my_first_model.sql, models/my_python_model.py\n# 3. dbt debug\n# 4. dbt run","lang":"python","description":"To get started with `dbt-glue`, you'll need to configure your `profiles.yml` to specify connection details for AWS Glue interactive sessions, including the IAM role, region, worker types, and S3 location. SQL models define transformations, and experimental Python models allow for more complex logic using PySpark DataFrames. Ensure `DBT_ROLE_ARN` and `DBT_S3_LOCATION` environment variables are set for authentication and storage paths respectively."},"warnings":[{"fix":"Always install `dbt-core` and `dbt-glue` together: `pip install dbt-core dbt-glue`.","message":"Beginning with dbt Core v1.8, installing a dbt adapter no longer automatically installs `dbt-core`. You must explicitly install both `dbt-core` and `dbt-glue` to avoid missing dependencies or version conflicts.","severity":"breaking","affected_versions":"dbt-core >=1.8.0, dbt-glue >=1.8.0"},{"fix":"Review the official dbt-glue documentation and GitHub README for the latest status and specific requirements before relying on these experimental features in production. Ensure your AWS Glue environment is version 4.0 or higher.","message":"Python model support and Amazon S3 Tables support are currently experimental. They may have limitations or breaking changes in future versions and require AWS Glue 4.0+ for optimal support, and Iceberg file format for Python models.","severity":"gotcha","affected_versions":"All versions with Python/S3 Tables support (from 1.10.9 onwards)"},{"fix":"For issues with table discovery, particularly in tests, ensure Iceberg-specific Spark configurations are correctly set in `profiles.yml` and consider explicitly using the `glue_catalog.` prefix where direct table references are made.","message":"When working with Iceberg tables on AWS Glue, especially in dbt tests or certain queries, you might need to explicitly prefix table names with `glue_catalog.` (e.g., `glue_catalog.your_database.your_table`) in custom SQL or specific configurations if not handled automatically by the adapter's macros.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Update your YAML files to remove duplicate keys and ensure proper Jinja syntax. Use `--select` or `-s` instead of `--models` for CLI commands. If using `--warn-error`, configure `warn-error-options` to handle deprecations appropriately.","message":"dbt Core v1.10 introduces deprecation warnings for several patterns, including duplicate keys in the same YAML file, unexpected Jinja blocks, and the `--models` / `--model` / `-m` CLI flags (which were renamed to `--select` / `--s` in v0.21).","severity":"deprecated","affected_versions":"dbt-core >=1.10.0, dbt-glue >=1.10.0"},{"fix":"Ensure the IAM role associated with your Glue jobs has the necessary permissions for S3 access, Glue Data Catalog operations, and Lake Formation if applicable. A least-privileged policy example is often available in the dbt-glue documentation.","message":"Incorrect IAM permissions for the Glue interactive session role can lead to `AccessDeniedException` errors, preventing dbt-glue from accessing S3 buckets or the Glue Data Catalog.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Ensure your `profiles.yml` includes the correct `conf` for Iceberg Spark extensions. For custom queries or tests, you may need to explicitly reference tables as `glue_catalog.your_database.your_table`. You can also set `--conf spark.sql.defaultCatalog=glue_catalog` in your profile.","cause":"This error often occurs when dbt tries to access an Iceberg table on Glue without the necessary `glue_catalog.` prefix, or if Iceberg Spark extensions are not properly configured in the dbt profile.","error":"AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table my_table. StorageDescriptor#InputFormat cannot be null for table: my_table"},{"fix":"Simplify seed data, especially for testing. Avoid complex structures like nested JSON within CSV seeds if possible. Ensure that the generated Spark code correctly handles `null` values or adjust the seed data format to be simpler for initial ingestion.","cause":"When building seeds, especially with CSVs containing nested JSON strings or complex data types, Spark (used by Glue) may misinterpret `null` values or the data format.","error":"NameError: name 'null' is not defined"},{"fix":"Increase `session_provisioning_timeout_in_seconds` in your `profiles.yml`. Review your `workers` and `worker_type` settings to ensure they are appropriate for your workload and available in your AWS region. Check AWS Glue service quotas and status.","cause":"The AWS Glue interactive session took too long to provision, often due to high demand, incorrect worker configurations, or network issues.","error":"Error in dbt run: Connection timeout for Glue interactive session provisioning"}]}