CSV Diff
csv-diff is a Python CLI tool and library for efficiently comparing the semantic contents of two CSV, TSV, or JSON files. It identifies added, removed, and changed rows based on a specified key, ignoring cosmetic differences like row and column ordering. The library is actively maintained with regular updates addressing features and bug fixes, with its current version being 1.2.
Warnings
- gotcha Providing a `key_columns` (or `key` parameter for `load_csv`) is crucial. Without a specified unique key, `csv-diff` cannot accurately identify matching rows for comparison, potentially leading to incorrect diff results or errors. This was explicitly addressed with a fix in version 1.0.
- breaking Prior to version 1.0, column names containing a `.` character could cause bugs. This was fixed in 1.0, potentially changing diff results for users who encountered this issue in earlier versions.
- gotcha While `csv-diff` automatically detects CSV/TSV/JSON formats, it's safer to explicitly specify the input format using `--format=csv`, `--format=tsv`, or `--format=json` for the CLI, or appropriate handling when loading data programmatically, especially for ambiguous files.
- deprecated The format of the human-readable CLI output changed significantly in versions 0.3.1 and 0.2 (e.g., order of output, inclusion of more detail). Scripts that parsed the CLI's plain text output in older versions might break or produce incorrect results with newer versions.
Install
-
pip install csv-diff
Imports
- load_csv
from csv_diff import load_csv, compare
- compare
from csv_diff import load_csv, compare
Quickstart
import io
from csv_diff import load_csv, compare
# Simulate two CSV files as in-memory strings
csv1_data = """id,name,age
1,Alice,30
2,Bob,24
3,Charlie,35"""
csv2_data = """id,name,age
1,Alice,31
3,Charlie,35
4,David,28"""
# Load the CSV data, specifying the key column
csv1 = load_csv(io.StringIO(csv1_data), key="id")
csv2 = load_csv(io.StringIO(csv2_data), key="id")
# Compare the two CSVs
diff = compare(csv1, csv2)
# Print the detected differences
print(f"Added rows: {diff.get('added')}")
print(f"Removed rows: {diff.get('removed')}")
print(f"Changed rows: {diff.get('changed')}")
print(f"Columns added: {diff.get('columns_added')}")
print(f"Columns removed: {diff.get('columns_removed')}")