SDV: Synthetic Data Vault

1.36.0 · active · verified Fri Apr 17

SDV (Synthetic Data Vault) is a Python library that allows users to generate synthetic data for various data types, including single tables, multi-table relational datasets, and sequential data. It provides a range of models and tools to create high-quality synthetic data that preserves the statistical properties and privacy of the original data. As of version 1.36.0, it continues to be actively developed, with a regular release cadence to add new features and improve existing models.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a demo dataset, initialize a `GaussianCopulaSynthesizer` with the dataset's metadata, fit the synthesizer to the real data, and then sample synthetic data. This is a common workflow for single-table synthetic data generation.

import pandas as pd
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.datasets.demo import load_dataset

# 1. Load a demo dataset (returns an SDVData object with data and metadata)
real_data = load_dataset('PUMS')

# 2. Initialize a synthesizer, passing the metadata
synthesizer = GaussianCopulaSynthesizer(metadata=real_data.metadata)

# 3. Fit the synthesizer to the real data
synthesizer.fit(real_data.data)

# 4. Sample synthetic data
synthetic_data = synthesizer.sample(num_rows=len(real_data.data))

print("Original data head:")
print(real_data.data.head())
print("\nSynthetic data head:")
print(synthetic_data.head())

view raw JSON →