Empirical Calibration
Empirical Calibration (EC) is a Python library (version 0.12) designed for correcting bias in data samples using generic weighting methods. It formulates the calibration problem as a convex optimization, solved efficiently in a dual form, and aims to reduce data biases in various statistical fields, such as survey sampling and causal studies with observational data. The library is actively maintained, with the latest release in May 2024 and ongoing development on GitHub.
Common errors
-
ValueError: operands could not be broadcast together with shapes (X,) (Y,)
cause This error often occurs when `covariates` and `target_covariates` have incompatible shapes or internal data structures that prevent proper element-wise operations during the calibration process. This could be due to different numbers of features or incorrect reshaping.fixVerify that `covariates` and `target_covariates` have the same number of columns (features) and that their internal representations (e.g., after `patsy.dmatrix` or `sklearn.preprocessing`) are compatible. Inspect the `.shape` attribute of the arrays passed to the calibration function. -
empirical_calibration.core.ConvergenceError: Maximum number of iterations reached.
cause The iterative optimization algorithm failed to find a solution that satisfies the convergence criteria within the allowed number of iterations. This means the target covariate distribution could not be matched exactly or within tolerance.fixIncrease `max_iter` (e.g., `max_iter=1000`) or relax `epsilon` (e.g., `epsilon=1e-3`) in the `maybe_exact_calibrate` function. Consider simplifying your covariates by grouping categories or binning continuous features, or re-evaluating if the target distribution is realistically achievable from the sample. -
KeyError: 'some_column_name'
cause The library attempts to access a column in your `covariates` or `target_covariates` that does not exist. This typically happens if column names in pandas DataFrames are inconsistent or misspelled.fixDouble-check that all column names in both `covariates_sample` and `target_covariates` are identical and spelled correctly. Print `df.columns` for both DataFrames to ensure alignment before passing them to the calibration function.
Warnings
- gotcha The `covariates` and `target_covariates` inputs should typically be pandas DataFrames or numpy arrays with consistent columns and order. Mismatched column names or different data types can lead to unexpected behavior or errors during internal preprocessing and optimization.
- gotcha The calibration optimization problem may not always converge, especially with highly disparate covariate distributions, sparse data, or certain objective choices. This results in a `ConvergenceError`.
- gotcha `empirical-calibration` is a distinct Python library. There is also an R package named 'EmpiricalCalibration' (e.g., by OHDSI) which addresses similar statistical concepts but has a different API and implementation. Do not confuse the two when searching for documentation or examples.
Install
-
pip install empirical-calibration -
pip install -q git+https://github.com/google/empirical_calibration
Imports
- empirical_calibration
import empirical_calibration as ec
Quickstart
import numpy as np
import pandas as pd
import empirical_calibration as ec
# Create dummy covariate dataframes for demonstration
# In a real scenario, these would come from your biased sample and target population
covariates_sample = pd.DataFrame({
'sex': np.random.choice([0, 1], size=100),
'age': np.random.randint(18, 65, size=100)
})
target_covariates = pd.DataFrame({
'sex': np.random.choice([0, 1], size=1000),
'age': np.random.randint(18, 65, size=1000)
})
# Apply empirical calibration to compute weights
# Using ENTROPY objective as a common choice
try:
weights, _ = ec.maybe_exact_calibrate(
covariates=covariates_sample,
target_covariates=target_covariates,
objective=ec.Objective.ENTROPY
)
print(f"Successfully computed weights. First 5 weights: {weights[:5]}")
print(f"Sum of weights: {np.sum(weights):.2f}")
except ec.ConvergenceError as e:
print(f"Calibration did not converge: {e}")
except Exception as e:
print(f"An unexpected error occurred: {e}")