DLT-META Framework
DLT-META is a metadata-driven framework for Databricks Lakeflow Declarative Pipelines, designed to automate the creation and management of bronze and silver data pipelines. It leverages metadata defined in JSON or YAML files to dynamically generate pipeline code, streamlining data engineering workflows. The library is currently at version 0.0.10 and has active, though irregular, release cycles with consistent updates.
Common errors
-
ImportError: Could not import DataflowPipeline from dlt_meta. Ensure the 'dlt-meta' library is installed and available.
cause The `dlt-meta` package is not installed or not accessible in the Python environment where the code is being run. This often happens in Databricks notebooks if `%pip install dlt-meta` wasn't executed, or if running locally without installing the package.fixEnsure `dlt-meta` is installed using `pip install dlt-meta`. In Databricks notebooks, use `%pip install dlt-meta` at the start of the notebook. If running a Databricks job using a Python wheel, ensure `dlt_meta` is specified as a dependent library. -
com.databricks.pipelines.common.errors.DLTAnalysisException: Materializing tables in custom schemas is not supported. Please remove the database qualifier from table 'table_name'.
cause This error occurs in dlt-meta v0.0.10 and later due to the 'Multi-Level Namespace Changes' breaking change. Custom schema qualification (e.g., `database.schema.table`) in table names within your metadata is no longer supported.fixUpdate your metadata files (JSON/YAML) to remove the database qualifier from table names. Tables should be defined without the database prefix, e.g., `schema.table` or just `table` if the schema is implicitly handled. -
CDF metadata columns (_change_type, _commit_version, _commit_timestamp) are lost after importing dlt (or dlt-meta which often uses dlt).
cause When the `dlt` module (which `dlt-meta` internally leverages) is imported, it can alter the default behavior of reading CDF-enabled tables, sometimes preventing the exposure of these reserved metadata columns due to conflicts or internal handling.fixTo address this, if you need these columns, try reading the CDF-enabled table *before* importing any `dlt` or `dlt_meta` modules. Alternatively, if within a DLT pipeline, use the `except_column_list` parameter to explicitly exclude these columns, or ensure reserved column names are not conflicting in your source table's schema. -
Pipeline fails due to schema evolution surprises or unexpected column changes in upstream sources.
cause The pipeline's schema expectations are not aligned with changes in the source data (e.g., columns added, removed, or renamed without corresponding metadata updates).fixEnable Delta's `mergeSchema` on writes where appropriate. Implement schema-drift detection jobs to monitor source schema changes and update `dlt-meta` metadata (Dataflowspec) promptly. Validate onboarding JSON/YAML against a predefined schema.
Warnings
- breaking The DPM (Direct Publishing Mode) flag was removed in v0.0.10. Pipelines using DPM mode in v0.0.9 must be migrated to the default publishing mode before upgrading. This change is metadata-only but irreversible.
- breaking Multi-Level Namespace Changes in v0.0.10. Custom schema qualification in table names is no longer supported; tables must be created without database qualifiers.
- breaking Argument changes for `invoke_dlt_pipeline` in v0.0.10. Method arguments now require `bronze_` or `silver_` prefixes to support `apply_changes_from_snapshot` in both layers.
- gotcha DLT-META is a Databricks Labs project and is provided for exploration only. Databricks does not formally support it or provide SLAs. Do not submit Databricks support tickets for issues; instead, file a GitHub issue.
- gotcha Malformed JSON/YAML metadata can lead to job failures. The framework relies heavily on correct metadata structure and content.
Install
-
pip install dlt-meta
Imports
- DataflowPipeline
from dlt_meta.src.dataflow_pipeline import DataflowPipeline
from dlt_meta import DataflowPipeline
Quickstart
# This code typically runs within a Databricks Notebook or job after metadata onboarding.
# Ensure 'dlt-meta' is installed via %pip install dlt-meta in the notebook or as a cluster library.
import dlt
from dlt_meta import DataflowPipeline
import os
# These parameters would typically be passed as job parameters in Databricks
# For local testing, you might set environment variables or hardcode.
layer = os.environ.get('DLT_META_LAYER', 'bronze').lower() # e.g., 'bronze' or 'silver'
env = os.environ.get('DLT_META_ENV', 'dev').lower() # e.g., 'dev', 'qa', 'prod'
# In a Databricks environment, 'spark' session is implicitly available.
# For local testing outside Databricks, you would need to initialize a SparkSession.
# Example placeholder for local SparkSession (not typically done in DLT-META's primary use-case):
# from pyspark.sql import SparkSession
# spark = SparkSession.builder.appName("dlt-meta-local").getOrCreate()
try:
print(f"Attempting to invoke DLT-META for layer: {layer} (env: {env}).")
# The 'spark' object is expected to be the Databricks SparkSession
DataflowPipeline.invoke_dlt_pipeline(spark=spark, layer=layer, env=env)
print(f"DLT-META successfully invoked for layer: {layer} (env: {env}).")
except ImportError:
print("ERROR: Could not import DataflowPipeline from dlt_meta. Ensure the 'dlt-meta' library is installed and available.")
raise
except Exception as e:
print(f"ERROR: An exception occurred during DLT-META pipeline invocation for layer '{layer}' in env '{env}': {e}")
raise