Collate SQL Lineage
Collate SQL Lineage is a Python tool designed to analyze SQL statements and extract their data lineage, mapping relationships between tables and columns. It builds upon `sqllineage` for core lineage parsing and `sqlfluff` for robust SQL parsing and linting. The library provides a programmatic interface for integrating lineage analysis into data governance, compliance, and impact analysis workflows. It is currently at version 2.1.0 and maintains an active development and release cadence.
Warnings
- breaking Version 2.0.0 introduced a significant API refactor, deprecating direct function calls for lineage analysis in favor of the `CollateSQLLineageRunner` class. Code written for versions prior to 2.0.0 will no longer work.
- gotcha Lineage extraction relies on `sqllineage` and `sqlfluff`, which may have limitations in fully parsing highly complex, non-standard, or niche SQL dialects. Edge cases or unsupported syntax might lead to incomplete or incorrect lineage results.
- gotcha While `collate-sqllineage` specifies dependency versions (`sqllineage`, `sqlfluff`), future updates to these underlying libraries might introduce subtle breaking changes in parsing behavior or output format that could impact lineage results, even if `collate-sqllineage`'s direct API remains stable.
Install
-
pip install collate-sqllineage
Imports
- CollateSQLLineageRunner
from collate_sqllineage.runner import CollateSQLLineageRunner
Quickstart
from collate_sqllineage.runner import CollateSQLLineageRunner
# Example SQL statement to analyze
sql_statement = """
INSERT INTO target_schema.target_table (col_a, col_b)
SELECT
source_schema.source_table_1.col_x,
source_schema.source_table_2.col_y
FROM
source_schema.source_table_1
JOIN
source_schema.source_table_2 ON source_schema.source_table_1.id = source_schema.source_table_2.id
WHERE
source_schema.source_table_1.status = 'active';
"""
# Initialize the runner. Configuration can be passed here (e.g., for verbose logging).
# runner = CollateSQLLineageRunner(config={'verbose': True})
runner = CollateSQLLineageRunner()
# Run the lineage analysis
lineage_result = runner.run(sql=sql_statement)
# Print the extracted lineage information
print("--- Extracted SQL Lineage ---")
print(f"Source Tables: {lineage_result.get('tables', {}).get('source', [])}")
print(f"Target Tables: {lineage_result.get('tables', {}).get('target', [])}")
# The 'lineage_result' dictionary contains more detailed information including columns, statements, etc.
# print(lineage_result)