sparkxgb

raw JSON →
0.4 verified Fri May 01 auth: no python maintenance

sparkxgb is a Python wrapper for XGBoost on Apache Spark, providing integration utilities for distributed training and prediction on Spark DataFrames. Version 0.4 is stable but infrequently updated; rely on official XGBoost Spark integration for newer APIs.

pip install sparkxgb
error ImportError: No module named sparkxgb
cause Package not installed or virtual environment not activated.
fix
Run: pip install sparkxgb
error Py4JJavaError: An error occurred while calling o1234.fit. : java.lang.NoSuchMethodError
cause Incompatible PySpark version (sparkxgb 0.4 requires PySpark <3.0).
fix
Install PySpark 2.4.x: pip install pyspark==2.4.8
error IllegalArgumentException: requirement failed: Column features must be of type org.apache.spark.ml.linalg.VectorUDT but was actually org.apache.spark.mllib.linalg.VectorUDT.
cause Using old mllib Vector instead of new ml Vector.
fix
Use pyspark.ml.linalg.Vector and VectorAssembler from pyspark.ml.feature.
breaking sparkxgb 0.4 requires PySpark <3.0 due to API changes in Spark 3.0+. Check your Spark version.
fix Downgrade PySpark to 2.4.x or use XGBoost's built-in Spark integration (xgboost >=1.7).
deprecated sparkxgb is no longer actively maintained; consider migrating to xgboost.spark (available from XGBoost 1.7+).
fix Replace imports: from xgboost.spark import SparkXGBClassifier
gotcha Feature column must be of type VectorUDT (e.g., from VectorAssembler). Passing raw arrays fails silently with wrong predictions.
fix Always use VectorAssembler to create the 'features' column: from pyspark.ml.feature import VectorAssembler; assembler = VectorAssembler(inputCols=[...], outputCol='features')