{"id":4377,"library":"phik","title":"Phi_K correlation analyzer library","description":"Phi_K is a practical correlation constant that works consistently between categorical, ordinal, and interval variables. It extends Pearson's hypothesis test of independence, capturing non-linear dependencies and reverting to Pearson's correlation for bi-variate normal distributions. The current version, 0.12.5, was released in July 2025. The library aims for a regular release cadence, with updates occurring every few months to a year, incorporating Python version support and bug fixes.","status":"active","version":"0.12.5","language":"en","source_language":"en","source_url":"https://github.com/KaveIO/PhiK","tags":["correlation","statistics","data analysis","categorical data","numerical data","mixed data types","pandas"],"install":[{"cmd":"pip install phik","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Essential for DataFrame operations and integrating phik methods.","package":"pandas"},{"reason":"Core numerical computing library, foundational for statistical calculations.","package":"numpy"},{"reason":"Used for statistical functions, including the underlying correlation calculations.","package":"scipy"},{"reason":"Enables parallel processing for certain phik calculations (e.g., using `njobs`).","package":"joblib"},{"reason":"Used for plotting correlation matrices and reports.","package":"matplotlib","optional":true}],"imports":[{"symbol":"phik_matrix","correct":"import pandas as pd\nimport phik\n\ndf.phik_matrix()"},{"symbol":"report.correlation_report","correct":"from phik import report\n\nreport.correlation_report(df)"},{"symbol":"phik_from_hist2d","correct":"from phik.phik import phik_from_hist2d"}],"quickstart":{"code":"import pandas as pd\nimport phik\nfrom phik import resources, report\n\n# Load example data\ndf = pd.read_csv(resources.fixture('fake_insurance_data.csv.gz'))\n\n# Calculate the phi_k correlation matrix\nphik_corr = df.phik_matrix()\nprint(phik_corr.head())\n\n# Calculate the significance matrix\nsignificance_matrix = df.significance_matrix()\nprint(significance_matrix.head())\n\n# Generate and save a correlation report (requires matplotlib)\n# report.correlation_report(df, pdf_file_name='phik_report.pdf')\n","lang":"python","description":"This quickstart demonstrates how to load a sample dataset, calculate the Phi_K correlation matrix, and the corresponding significance matrix. It also shows how to generate a comprehensive correlation report (commented out as it requires a local PDF save and matplotlib)."},"warnings":[{"fix":"Upgrade Python to version 3.9 or later.","message":"Python 3.7 and 3.8 support has been dropped. Version 0.12.4 dropped 3.7, and 0.12.5 dropped 3.8. Ensure your Python environment is 3.9 or newer.","severity":"breaking","affected_versions":">=0.12.4"},{"fix":"Ensure `phik` and `scipy` versions are compatible. Upgrade `phik` to 0.12.5 or newer if using a recent `scipy`.","message":"The `scipy.stats.mvn` function was migrated to `scipy.stats.qmvn` in version 0.12.5 due to deprecation in newer SciPy versions. Using older `phik` versions with a newer `scipy` or vice-versa might lead to compatibility issues.","severity":"breaking","affected_versions":"0.12.5"},{"fix":"Ensure your system has the necessary C++ compilers (e.g., GCC, Clang, MSVC) and `pybind11` development headers if you intend to use the hypergeometric method. Pre-built wheels for common OS are usually available.","message":"The optional C++ extension for computing the significance matrix (hypergeometric/Patefield method) might not build during a manual `pip install` on some systems. If it fails, `phik` will install without it, and attempting to use this method will raise a `NotImplementedError`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Carefully consider and test different binning strategies for interval variables relevant to your analysis. Use the `bins` parameter in methods like `phik_matrix()`.","message":"The calculated Phi_K correlation value for interval (continuous) variables is dependent on the chosen binning. The default is 10 uniform bins, but custom binning can significantly alter results.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Utilize the `njobs` parameter in `phik_matrix()` and `significance_matrix()` to enable parallel processing and speed up computations, or process data in chunks if memory is also a concern.","message":"Phi_K correlation is computationally expensive, especially for large datasets, due to the underlying integral calculations. This can lead to longer processing times.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Supplement Phi_K analysis with other methods or visualizations (e.g., scatter plots, contingency tables) to understand the nature and direction of dependencies between variables.","message":"Phi_K values range from 0 to 1 and do not indicate the direction of a relationship (e.g., positive or negative correlation), only its strength.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}