{"id":6801,"library":"pyhdfe","title":"pyhdfe","description":"pyhdfe is a Python library for absorbing high-dimensional fixed effects, implementing the algorithm developed by Gaure (2013). It is primarily used in econometrics and statistics for estimating models with several high-dimensional fixed effects, optimized for sparse data structures. The current version is 0.2.0, with an intermittent release cadence.","status":"active","version":"0.2.0","language":"en","source_language":"en","source_url":"https://github.com/jeffgortmaker/pyhdfe","tags":["econometrics","fixed-effects","panel-data","statistics","high-dimensional","regression"],"install":[{"cmd":"pip install pyhdfe","lang":"bash","label":"Install stable release"}],"dependencies":[{"reason":"Numerical operations and array handling.","package":"numpy"},{"reason":"Scientific computing routines, potentially for sparse matrix operations or optimization.","package":"scipy"},{"reason":"Data structures like DataFrames and Series for input/output and data manipulation.","package":"pandas"}],"imports":[{"note":"The core functionality is exposed through the 'hdfe' submodule.","symbol":"hdfe","correct":"from pyhdfe import hdfe"}],"quickstart":{"code":"import numpy as np\nimport pandas as pd\nfrom pyhdfe import hdfe\n\n# Create some dummy data\nnp.random.seed(42)\nn_obs = 1000\nn_fixed_effects = 3\n\nX = pd.DataFrame(np.random.rand(n_obs, 5), columns=[f'x{i}' for i in range(5)])\ny = pd.Series(np.random.rand(n_obs))\n\nfixed_effects_data = []\nfor i in range(n_fixed_effects):\n    n_levels = np.random.randint(50, 200) # Varying number of levels\n    fixed_effects_data.append(pd.Series(np.random.randint(0, n_levels, n_obs)))\n\n# Absorb fixed effects from X and y\n# fixed_effects is a list of 1D arrays/Series representing each fixed effect column\n# absorb_cols specifies which columns from X to transform\n# drop_cols specifies columns to drop before transformation (often the intercept)\n\nX_transformed, y_transformed = hdfe.hdfe_cluster_col(\n    X,\n    y,\n    fixed_effects=fixed_effects_data,\n    absorb_cols=X.columns.tolist(), # Absorb all X columns\n    drop_cols=[], # No columns to drop in this example\n    get_residuals=True\n)\n\nprint(f\"Original X shape: {X.shape}\")\nprint(f\"Transformed X shape: {X_transformed.shape}\")\nprint(f\"Original y shape: {y.shape}\")\nprint(f\"Transformed y shape: {y_transformed.shape}\")","lang":"python","description":"This quickstart demonstrates how to use `hdfe.hdfe_cluster_col` to absorb multiple high-dimensional fixed effects from a feature matrix `X` and a target vector `y`. It generates synthetic data with several fixed effect columns and then applies the absorption, returning the transformed (residualized) `X` and `y`."},"warnings":[{"fix":"Consider downsampling, using more memory-efficient data types, or processing data in chunks if memory becomes a bottleneck. The library is optimized for sparse matrix operations, ensuring your input data structures (e.g., pandas Series/DataFrame) are appropriate can help.","message":"For extremely large datasets or a very high number of levels in fixed effects, memory consumption can still be significant, despite optimizations for sparse data. Monitor memory usage carefully.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If convergence issues arise, consider increasing `max_iter` or `tol` parameters if available (not directly exposed in `hdfe_cluster_col` but for other related functions). Review the structure of your fixed effects for potential issues, or simplify the model if necessary.","message":"The Gauss-Seidel algorithm used for absorption can sometimes converge slowly or fail to converge for certain data structures or with highly collinear fixed effects. This is a common challenge for iterative solvers.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure that `fixed_effects` is structured as `[pd.Series(fe1), pd.Series(fe2), ...]`. Each Series/array should contain the categorical identifiers for that specific fixed effect.","message":"The `fixed_effects` argument expects a list of 1-D arrays or pandas Series, where each element represents a single fixed effect column. Incorrect formatting (e.g., passing a 2-D array directly) will lead to errors.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-15T00:00:00.000Z","next_check":"2026-07-14T00:00:00.000Z","problems":[]}