dlt (data load tool)

1.24.0 · active · verified Thu Apr 09

dlt is an open-source python-first scalable data loading library that does not require any backend to run. It simplifies data ingestion from various sources to analytical destinations, handling schema evolution, retries, and state management. dlt releases new versions frequently, approximately monthly, often including breaking changes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to create a simple dlt pipeline to extract GitHub issues and load them into a DuckDB destination. Ensure you have `duckdb` installed (`pip install dlt[duckdb]`) and set a `GITHUB_TOKEN` environment variable if you hit rate limits.

import dlt
import os
from dlt.sources.helpers import requests

# Define a dlt source using a decorator
@dlt.source
def github_issues_source(token):
    # Define a dlt resource within the source
    @dlt.resource(write_disposition="append")
    def issues(owner, repo):
        headers = {"Authorization": f"token {token}"}
        url = f"https://api.github.com/repos/{owner}/{repo}/issues"
        # Fetching a single page for demonstration
        response = requests.get(url, headers=headers, params={'per_page': 10})
        response.raise_for_status()
        yield response.json()

    return issues

# Instantiate and run the pipeline
pipeline = dlt.pipeline(
    pipeline_name="github_pipeline",
    destination="duckdb", # Or "bigquery", "snowflake", etc.
    dataset_name="github_data"
)

load_info = pipeline.run(
    github_issues_source(token=os.environ.get('GITHUB_TOKEN', ''))("dlt-hub", "dlt")
)

# Print the outcome
print(load_info)

view raw JSON →