Copulas
Copulas is a Python library for modeling multivariate distributions and sampling from them using copula functions. It enables users to learn the dependence structure from tabular numerical data and generate new synthetic data with similar statistical properties, offering various univariate distributions, Archimedian, Gaussian, and Vine Copulas. As part of The Synthetic Data Vault Project by DataCebo, it is actively maintained with regular updates.
Common errors
-
AttributeError: module 'copulas' has no attribute 'GaussianCopula'
cause Attempting to import a specific copula model (e.g., `GaussianCopula` or `GaussianMultivariate`) directly from the top-level `copulas` package.fixImport multivariate and bivariate copula models from their specific submodules. For example, use `from copulas.multivariate import GaussianMultivariate` instead. -
ValueError: Input data must be numerical
cause The `copulas` library expects numerical input data. This error occurs when a DataFrame or array containing non-numerical (e.g., string, object, or boolean) columns is passed to a copula model.fixPreprocess your data to ensure all columns intended for modeling are numerical. This may involve one-hot encoding categorical features, label encoding, or converting mixed-type columns. Remove or handle missing values appropriately. -
RuntimeError: The number of features in the data is X, but the copula expects Y.
cause Mismatch in dimensionality between the input data and the copula model, often when a previously fitted model (or one with a predefined structure) is used with new data of a different number of columns.fixEnsure the input data (e.g., `pandas.DataFrame` or `numpy.ndarray`) has the same number of columns (features) as the copula model was originally fitted with, or explicitly define the model for the new dimensionality.
Warnings
- gotcha The Gaussian copula, a common choice, assumes an elliptical dependence structure and exhibits zero tail dependence. Applying it to data with strong non-linear or asymmetric tail dependencies (e.g., financial returns during market crashes) can significantly underestimate joint extreme events.
- gotcha The `copulas` library primarily expects numerical and stationary data. Direct application to raw categorical data or non-stationary time series (e.g., raw stock prices instead of returns) can lead to unreliable models and synthetic data quality issues.
- breaking The library is part of the SDV (Synthetic Data Vault) ecosystem and has undergone API changes. Older versions (e.g., prior to `v0.2.0`) had different API for statistics methods, input/output formats, and less robust implementations, potentially breaking code written for newer versions.
- gotcha Choosing the appropriate copula (e.g., Archimedian, Gaussian, Vine) and univariate distributions for high-dimensional or complex datasets is crucial and non-trivial. An inappropriate model choice may fail to capture the underlying data structure accurately, leading to synthetic data that does not truly resemble the real data.
Install
-
pip install copulas -
conda install -c conda-forge copulas
Imports
- sample_trivariate_xyz
from copulas.datasets import sample_trivariate_xyz
- GaussianMultivariate
from copulas import GaussianMultivariate
from copulas.multivariate import GaussianMultivariate
- compare_3d
from copulas.visualization import compare_3d
Quickstart
import pandas as pd
from copulas.datasets import sample_trivariate_xyz
from copulas.multivariate import GaussianMultivariate
import warnings
# Suppress FutureWarnings from certain dependencies for cleaner output
warnings.filterwarnings('ignore', category=FutureWarning)
# 1. Load a demo dataset (or your own pandas DataFrame)
real_data = sample_trivariate_xyz()
print("Original Data Head:\n", real_data.head())
# 2. Initialize and fit a multivariate copula model
copula = GaussianMultivariate()
copula.fit(real_data)
print("\nCopula model fitted successfully.")
# 3. Generate new synthetic data points
synthetic_data = copula.sample(len(real_data))
print("\nSynthetic Data Head:\n", synthetic_data.head())
# Optional: To visualize, uncomment the following lines and ensure a graphical environment
# from copulas.visualization import compare_3d
# compare_3d(real_data, synthetic_data, figsize=(10, 5))
# print("\nComparison plot generated (if running in a graphical environment).")