TDDA: Test-Driven Data Analysis
TDDA (Test-Driven Data Analysis) is a Python library and set of command-line tools designed to improve the correctness and robustness of data analysis. It provides features for reference testing of data pipelines, automatic discovery and verification of data constraints, anomaly detection, and inference of regular expressions from text data (Rexpy). Additionally, from version 2.0, it includes features for automatic test generation (Gentest) for command-line programs. It currently supports Python >=3.8 and is actively maintained, with version 2.2.17 being the latest stable release.
Warnings
- breaking Python 2.7 support has been dropped. The library previously supported Python 2.7, but current versions (>=2.0) explicitly require Python >=3.8. Older codebases targeting Python 2.7 will break if upgrading `tdda` without migrating their Python environment.
- deprecated The `WritableTestCase` class for reference testing has been superseded by `ReferenceTest`. While `WritableTestCase` might still exist for backward compatibility in some older versions, new development should use `ReferenceTest` for improved features and maintainability.
- gotcha Many features, particularly for constraint generation and verification against various data sources (databases, Feather files), rely on optional external dependencies (e.g., `pandas`, `feather-format`, database drivers). These packages are not installed by default with `pip install tdda` and must be installed separately if their corresponding functionality is required.
- gotcha When installing `feather-format` on Windows, you may encounter issues requiring `cython` and the Microsoft Visual C++ compiler for Python. This is a common prerequisite for many Python packages with C extensions on Windows.
- gotcha The acronym "TDDA" is used by several unrelated projects (e.g., Java Thread Dump Analyzer, The Drug Detection Agency, Topological Data Analysis). This can lead to confusion when searching for documentation, examples, or discussing the Python `tdda` library. Ensure you are referencing the correct project.
Install
-
pip install tdda
Imports
- discover_df
from tdda.constraints import discover_df
- verify_df
from tdda.constraints import verify_df
- ReferenceTestCase
from tdda.referencetest import WritableTestCase
from tdda.referencetest import ReferenceTestCase
Quickstart
import pandas as pd
from tdda.constraints import discover_df, verify_df
import os
# Create a sample DataFrame
data = {
'col1': [1, 2, 3, 4, 5, None],
'col2': ['A', 'B', 'A', 'C', 'B', 'D'],
'col3': [10.1, 11.2, 10.1, 13.4, 15.5, 12.3]
}
df = pd.DataFrame(data)
# 1. Discover constraints from the DataFrame
constraints = discover_df(df)
# Constraints object has a to_json() method to save them
constraints_filename = 'my_dataframe_constraints.tdda'
with open(constraints_filename, 'w') as f:
f.write(constraints.to_json())
print(f"Constraints discovered and saved to {constraints_filename}")
# 2. Verify a (potentially new or modified) DataFrame against the constraints
# Let's create a slightly different DataFrame for verification
df_to_verify = pd.DataFrame({
'col1': [1, 2, 3, 6, 5, 7],
'col2': ['A', 'B', 'A', 'C', 'B', 'E'],
'col3': [10.1, 11.2, 10.1, 13.0, 15.5, 12.0]
})
verification_result = verify_df(df_to_verify, constraints_filename)
print("\nVerification Results:")
print(f"Passed constraints: {verification_result.passes}")
print(f"Failed constraints: {verification_result.failures}")
if verification_result.failures > 0:
print("Details of failed constraints:")
print(verification_result.to_frame())
# Clean up the generated constraints file
os.remove(constraints_filename)