Phi_K correlation analyzer library
Phi_K is a practical correlation constant that works consistently between categorical, ordinal, and interval variables. It extends Pearson's hypothesis test of independence, capturing non-linear dependencies and reverting to Pearson's correlation for bi-variate normal distributions. The current version, 0.12.5, was released in July 2025. The library aims for a regular release cadence, with updates occurring every few months to a year, incorporating Python version support and bug fixes.
Warnings
- breaking Python 3.7 and 3.8 support has been dropped. Version 0.12.4 dropped 3.7, and 0.12.5 dropped 3.8. Ensure your Python environment is 3.9 or newer.
- breaking The `scipy.stats.mvn` function was migrated to `scipy.stats.qmvn` in version 0.12.5 due to deprecation in newer SciPy versions. Using older `phik` versions with a newer `scipy` or vice-versa might lead to compatibility issues.
- gotcha The optional C++ extension for computing the significance matrix (hypergeometric/Patefield method) might not build during a manual `pip install` on some systems. If it fails, `phik` will install without it, and attempting to use this method will raise a `NotImplementedError`.
- gotcha The calculated Phi_K correlation value for interval (continuous) variables is dependent on the chosen binning. The default is 10 uniform bins, but custom binning can significantly alter results.
- gotcha Phi_K correlation is computationally expensive, especially for large datasets, due to the underlying integral calculations. This can lead to longer processing times.
- gotcha Phi_K values range from 0 to 1 and do not indicate the direction of a relationship (e.g., positive or negative correlation), only its strength.
Install
-
pip install phik
Imports
- phik_matrix
import pandas as pd import phik df.phik_matrix()
- report.correlation_report
from phik import report report.correlation_report(df)
- phik_from_hist2d
from phik.phik import phik_from_hist2d
Quickstart
import pandas as pd
import phik
from phik import resources, report
# Load example data
df = pd.read_csv(resources.fixture('fake_insurance_data.csv.gz'))
# Calculate the phi_k correlation matrix
phik_corr = df.phik_matrix()
print(phik_corr.head())
# Calculate the significance matrix
significance_matrix = df.significance_matrix()
print(significance_matrix.head())
# Generate and save a correlation report (requires matplotlib)
# report.correlation_report(df, pdf_file_name='phik_report.pdf')