sparkxgb
raw JSON → 0.4 verified Fri May 01 auth: no python maintenance
sparkxgb is a Python wrapper for XGBoost on Apache Spark, providing integration utilities for distributed training and prediction on Spark DataFrames. Version 0.4 is stable but infrequently updated; rely on official XGBoost Spark integration for newer APIs.
pip install sparkxgb Common errors
error ImportError: No module named sparkxgb ↓
cause Package not installed or virtual environment not activated.
fix
Run: pip install sparkxgb
error Py4JJavaError: An error occurred while calling o1234.fit. : java.lang.NoSuchMethodError ↓
cause Incompatible PySpark version (sparkxgb 0.4 requires PySpark <3.0).
fix
Install PySpark 2.4.x: pip install pyspark==2.4.8
error IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT but was actually org.apache.spark.mllib.linalg.VectorUDT. ↓
cause Using old mllib Vector instead of new ml Vector.
fix
Use pyspark.ml.linalg.Vector and VectorAssembler from pyspark.ml.feature.
Warnings
breaking sparkxgb 0.4 requires PySpark <3.0 due to API changes in Spark 3.0+. Check your Spark version. ↓
fix Downgrade PySpark to 2.4.x or use XGBoost's built-in Spark integration (xgboost >=1.7).
deprecated sparkxgb is no longer actively maintained; consider migrating to xgboost.spark (available from XGBoost 1.7+). ↓
fix Replace imports: from xgboost.spark import SparkXGBClassifier
gotcha Feature column must be of type VectorUDT (e.g., from VectorAssembler). Passing raw arrays fails silently with wrong predictions. ↓
fix Always use VectorAssembler to create the 'features' column: from pyspark.ml.feature import VectorAssembler; assembler = VectorAssembler(inputCols=[...], outputCol='features')
Imports
- XGBoostEstimator
from sparkxgb import XGBoostEstimator - XGBoostClassificationModel
from sparkxgb import XGBoostClassificationModel - XGBoostRegressor wrong
from xgboost.spark import SparkXGBRegressor (newer API)correctInternal server error