dlt (data load tool)
dlt is an open-source python-first scalable data loading library that does not require any backend to run. It simplifies data ingestion from various sources to analytical destinations, handling schema evolution, retries, and state management. dlt releases new versions frequently, approximately monthly, often including breaking changes.
Warnings
- breaking Pydantic v1 support was removed in dlt 1.22.0. The library now exclusively requires Pydantic v2.
- breaking The legacy Streamlit-based pipeline dashboard (`dlt pipeline show`) was removed in dlt 1.23.0.
- breaking Custom resource metrics within the trace object are now stored in a table format, altering their structure and location in dlt 1.24.0.
- breaking The `data_type` contract's semantic changed in dlt 1.22.0. It now applies to the full data type (including precision, nullability), not just variant columns.
- breaking dlt 1.23.0 introduced a new compact source configuration lookup path (`sources.<name>.<key>`). This changes config resolution logic.
Install
-
pip install dlt
Imports
- dlt
import dlt
- pipeline
from dlt import pipeline
- source
from dlt import source
- resource
from dlt import resource
Quickstart
import dlt
import os
from dlt.sources.helpers import requests
# Define a dlt source using a decorator
@dlt.source
def github_issues_source(token):
# Define a dlt resource within the source
@dlt.resource(write_disposition="append")
def issues(owner, repo):
headers = {"Authorization": f"token {token}"}
url = f"https://api.github.com/repos/{owner}/{repo}/issues"
# Fetching a single page for demonstration
response = requests.get(url, headers=headers, params={'per_page': 10})
response.raise_for_status()
yield response.json()
return issues
# Instantiate and run the pipeline
pipeline = dlt.pipeline(
pipeline_name="github_pipeline",
destination="duckdb", # Or "bigquery", "snowflake", etc.
dataset_name="github_data"
)
load_info = pipeline.run(
github_issues_source(token=os.environ.get('GITHUB_TOKEN', ''))("dlt-hub", "dlt")
)
# Print the outcome
print(load_info)