Collate SQL Lineage

2.1.0 · active · verified Sun Apr 12

Collate SQL Lineage is a Python tool designed to analyze SQL statements and extract their data lineage, mapping relationships between tables and columns. It builds upon `sqllineage` for core lineage parsing and `sqlfluff` for robust SQL parsing and linting. The library provides a programmatic interface for integrating lineage analysis into data governance, compliance, and impact analysis workflows. It is currently at version 2.1.0 and maintains an active development and release cadence.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize `CollateSQLLineageRunner` and use its `run()` method to extract data lineage from a sample SQL `INSERT` statement. The output includes source and target tables, with the full result containing more granular details like column-level lineage.

from collate_sqllineage.runner import CollateSQLLineageRunner

# Example SQL statement to analyze
sql_statement = """
INSERT INTO target_schema.target_table (col_a, col_b)
SELECT
    source_schema.source_table_1.col_x,
    source_schema.source_table_2.col_y
FROM
    source_schema.source_table_1
JOIN
    source_schema.source_table_2 ON source_schema.source_table_1.id = source_schema.source_table_2.id
WHERE
    source_schema.source_table_1.status = 'active';
"""

# Initialize the runner. Configuration can be passed here (e.g., for verbose logging).
# runner = CollateSQLLineageRunner(config={'verbose': True})
runner = CollateSQLLineageRunner()

# Run the lineage analysis
lineage_result = runner.run(sql=sql_statement)

# Print the extracted lineage information
print("--- Extracted SQL Lineage ---")
print(f"Source Tables: {lineage_result.get('tables', {}).get('source', [])}")
print(f"Target Tables: {lineage_result.get('tables', {}).get('target', [])}")

# The 'lineage_result' dictionary contains more detailed information including columns, statements, etc.
# print(lineage_result)

view raw JSON →