SDV: Synthetic Data Vault
SDV (Synthetic Data Vault) is a Python library that allows users to generate synthetic data for various data types, including single tables, multi-table relational datasets, and sequential data. It provides a range of models and tools to create high-quality synthetic data that preserves the statistical properties and privacy of the original data. As of version 1.36.0, it continues to be actively developed, with a regular release cadence to add new features and improve existing models.
Common errors
-
ModuleNotFoundError: No module named 'sdv.models'
cause Attempting to import a synthesizer from an old module path that was removed in SDV v0.17.0.fixChange your import statement from `from sdv.models import SynthesizerName` to `from sdv.single_table import SynthesizerName` (or `multi_table`/`sequential` as appropriate). -
ValueError: The column '...' contains unsupported data types. Supported data types are numeric, boolean, datetime, and categorical.
cause SDV synthesizers have limitations on the types of data they can process directly (e.g., complex objects, nested lists, mixed types), or metadata inference incorrectly assigned a type.fixPreprocess your data to convert unsupported columns into one of the supported types. Explicitly define column types in your `sdv.metadata` object to guide the synthesizer. -
NotEnoughDataError: Not enough data for synthesizer to learn from. Expected at least X rows but got Y rows.
cause The input dataset provided to `synthesizer.fit()` has too few rows for the selected synthesizer to effectively learn the underlying data patterns and statistical distributions.fixEnsure your training data has a sufficient number of rows (typically several dozens or hundreds at minimum, depending on complexity) to provide enough statistical information for the model. SDV is not designed for extremely small datasets.
Warnings
- breaking Synthesizer import paths were changed in SDV v0.17.0. The `sdv.models` and `sdv.tabular` modules were removed.
- gotcha While SDV can infer metadata, explicit metadata definition is often crucial for higher quality synthetic data, especially with complex schemas or specific data types (e.g., primary keys, relationships, sensitive columns).
- gotcha Generating synthetic data for very large datasets (millions of rows) or complex multi-table schemas can be memory-intensive and time-consuming.
Install
-
pip install sdv
Imports
- GaussianCopulaSynthesizer
from sdv.models import GaussianCopulaSynthesizer
from sdv.single_table import GaussianCopulaSynthesizer
- SingleTablePreset
from sdv.lite import SingleTablePreset
from sdv.single_table.preset import SingleTablePreset
- load_dataset
from sdv.datasets.demo import load_dataset
Quickstart
import pandas as pd
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.datasets.demo import load_dataset
# 1. Load a demo dataset (returns an SDVData object with data and metadata)
real_data = load_dataset('PUMS')
# 2. Initialize a synthesizer, passing the metadata
synthesizer = GaussianCopulaSynthesizer(metadata=real_data.metadata)
# 3. Fit the synthesizer to the real data
synthesizer.fit(real_data.data)
# 4. Sample synthetic data
synthetic_data = synthesizer.sample(num_rows=len(real_data.data))
print("Original data head:")
print(real_data.data.head())
print("\nSynthetic data head:")
print(synthetic_data.head())