CTGAN
CTGAN is a Python library implementing a Conditional Generative Adversarial Network (GAN) specifically designed for synthesizing tabular data. It learns from real datasets to generate high-fidelity synthetic data, addressing challenges like mixed data types and imbalanced categorical columns. The library is actively maintained, with version 0.12.1 released in February 2026, and is part of the broader SDV (Synthetic Data Vault) ecosystem.
Common errors
-
ValueError: Input data contains NaN values. CTGAN cannot handle missing values.
cause The input DataFrame passed to `ctgan.fit()` or `ctgan.sample()` contains null (NaN) values.fixPreprocess your data to fill or remove NaN values. Common approaches include `df.dropna()` or `df.fillna(value)` with an appropriate strategy (e.g., mean, median, mode, or a constant). -
KeyError: 'column_name' not found in discrete_columns list.
cause There is a mismatch between a column name in your DataFrame that you intend to be discrete and the `discrete_columns` list provided to the CTGAN model.fixCarefully check that all column names listed in `discrete_columns` exactly match the column names in your input pandas DataFrame. Pay attention to case sensitivity and typos. -
Generator loss is becoming negative during training.
cause This is often a misunderstood aspect of GAN training. A negative generator loss usually indicates that the generator is successfully improving at fooling the discriminator, which is a desirable outcome.fixNo fix is needed. Continue monitoring the training process. Stable negative generator loss alongside discriminator loss oscillating around zero generally signifies successful training. Diverging or exploding losses are a concern. -
Model does not seem to converge / Loss values are unstable or not improving.
cause Training a GAN can be challenging. This can be due to insufficient epochs, unsuitable hyperparameters, or inherent complexity/issues within the dataset.fixIncrease the number of `epochs`. Experiment with `CTGAN` hyperparameters such as `batch_size`, `generator_dim`, `discriminator_dim`, `generator_lr`, and `discriminator_lr`. Ensure your data quality is good and consider the limitations for high-cardinality/skewed data.
Warnings
- gotcha When using CTGAN directly (not through SDV), manual data preprocessing is often required. Continuous columns must be floats, discrete columns as integers or strings, and the data should not contain any missing values (NaNs).
- gotcha CTGAN generates float values for all numerical columns. If your original data contains integer columns that require integer output, you must manually round the generated synthetic values.
- gotcha CTGAN can struggle with high-cardinality features, highly skewed distributions, or very small datasets. Performance may be less accurate in these scenarios.
- gotcha CTGAN does not inherently handle primary key/foreign key constraints or other complex relational data integrity rules. The generated data may violate such constraints if not enforced externally.
- deprecated The `loss_values` attribute of a trained CTGAN model changed from returning `torch.Tensors` to standard Python floats.
Install
-
pip install ctgan
Imports
- CTGAN
from ctgan import CTGAN
- CTGANSynthesizer
from ctgan.synthesizers import CTGANSynthesizer
from ctgan import CTGANSynthesizer
- load_demo
from ctgan import load_demo
Quickstart
import pandas as pd
from ctgan import CTGAN, load_demo
# Load demo data (Adult Census Dataset) or replace with your own DataFrame
real_data = load_demo()
# Identify discrete columns
discrete_columns = [
'workclass', 'education', 'marital-status', 'occupation',
'relationship', 'race', 'sex', 'native-country'
]
# Initialize and train the CTGAN model
# Set verbose=True to see training progress
ctgan = CTGAN(epochs=10, verbose=True)
ctgan.fit(real_data, discrete_columns)
# Generate synthetic data
synthetic_data = ctgan.sample(num_rows=1000)
print("Original data head:")
print(real_data.head())
print("\nSynthetic data head:")
print(synthetic_data.head())