SDMetrics
SDMetrics is an open-source Python library developed by DataCebo (part of the Synthetic Data Vault project) for evaluating the quality and efficacy of synthetic datasets. It provides a variety of metrics to compare synthetic data against real data across aspects like quality, privacy, and utility, and includes tools for generating comprehensive visual reports. The library is model-agnostic, allowing evaluation of synthetic data generated by any model. The current version is 0.28.0, with active and frequent releases.
Common errors
-
TypeError: Expected a dictionary but received a <class 'sdv.metadata.SingleTableMetadata'> instead.
cause An `sdv.metadata.SingleTableMetadata` (or similar SDV metadata object) was passed directly to an SDMetrics report method that expects a standard Python dictionary for metadata.fixConvert the SDV metadata object to a dictionary using its `.to_dict()` method: `report.generate(real_data, synthetic_data, sdv_metadata_object.to_dict())`. -
ValueError: Inputs contain NaN, infinity or a value too large for dtype('float64').cause This error often occurs when numerical data contains missing values (NaNs) or extreme values that a metric or underlying scikit-learn model cannot handle without prior processing.fixPre-process your real and synthetic dataframes to handle missing values (e.g., imputation or removal) and outliers before passing them to SDMetrics. Check the metadata to ensure correct `sdtype` for columns. -
KeyError: 'column_name not found'
cause The specified 'column_name' in a metric computation (e.g., `CategoryCoverage.compute`) or a report configuration does not exist in the provided real or synthetic dataframes.fixVerify that the column names in your dataframes exactly match those referenced in your SDMetrics calls and the `metadata` dictionary. -
IncomputableMetricError: The metric cannot be computed with the given data.
cause This generic error can occur if the data does not meet the specific requirements of a metric (e.g., attempting a numerical correlation metric on categorical data, or insufficient data points).fixReview the documentation for the specific metric being used to understand its data requirements. Ensure column `sdtype` in the metadata accurately reflects the data types and that there's enough data for computation.
Warnings
- breaking SDMetrics dropped support for Python 3.8 starting from version 0.24.0. Ensure your environment uses Python 3.9 or newer.
- breaking SDMetrics pinned Pandas below version 3.0 in v0.26.0 to ensure compatibility. Direct usage with Pandas 3.x might lead to unexpected behavior or errors.
- gotcha When using `CorrelationSimilarity` on noisy data with no clear trends, the metric might return a high score, indicating that the synthetic data successfully captures the non-existent 'trend'. This can be misleading if you expect to measure actual correlation preservation.
- gotcha When generating reports, if some metric computations fail, SDMetrics might report them as `NaN` (Not a Number) scores rather than explicit errors, potentially hiding underlying issues with data or metric configuration.
- gotcha Passing an `SDV` metadata object directly to `sdmetrics.reports` (e.g., `QualityReport.generate`) will raise a `TypeError`. SDMetrics expects a plain dictionary for metadata.
Install
-
pip install sdmetrics
Imports
- load_demo
from sdmetrics import load_demo
- QualityReport
from sdmetrics.reports.single_table import QualityReport
- CategoryCoverage
from sdmetrics.single_column import CategoryCoverage
Quickstart
import pandas as pd
from sdmetrics import load_demo
from sdmetrics.reports.single_table import QualityReport
# Load demo data (real, synthetic, and metadata)
real_data, synthetic_data, metadata = load_demo(modality='single_table')
# Or create your own dataframes and metadata
# real_data = pd.DataFrame({'column1': [1, 2, 3], 'column2': ['A', 'B', 'C']})
# synthetic_data = pd.DataFrame({'column1': [1, 2, 2], 'column2': ['A', 'C', 'B']})
# metadata = {'columns': {'column1': {'sdtype': 'numerical'}, 'column2': {'sdtype': 'categorical'}}, 'primary_key': None}
# Create a QualityReport
report = QualityReport()
# Generate the report
report.generate(real_data, synthetic_data, metadata)
# Print the overall quality score
print(f"Overall Quality Score: {report.get_score():.2f}%")
# Get a visualization for a specific property (e.g., 'Column Shapes')
# fig = report.get_visualization(property_name='Column Shapes')
# fig.show()
# Save the report
# report.save(filepath='demo_data_quality_report.pkl')
# To load later: loaded_report = QualityReport.load(filepath='demo_data_quality_report.pkl')