Sparkling Water (H2O PySparkling 3.1)
raw JSON → 3.46.0.6.post1 verified Fri May 01 auth: no python
Sparkling Water integrates H2O's Fast Scalable Machine Learning with Apache Spark, enabling scalable ML workflows. Current version: 3.46.0.6.post1. Release cadence follows H2O-3 major/minor releases.
pip install h2o-pysparkling-3.1 Common errors
error ModuleNotFoundError: No module named 'pysparkling' ↓
cause Package not installed or wrong environment (Python vs PySpark venv).
fix
Install h2o-pysparkling-3.1 in the same environment where PySpark runs: pip install h2o-pysparkling-3.1
error IllegalArgumentException: requirement failed: Wrong FS: hdfs://... expected file:/// ↓
cause H2O attempt to load data from HDFS but Spark configuration not set.
fix
Set Spark config: spark.hadoop.fs.defaultFS=hdfs://namenode:8020 or use local files with file://
error py4j.protocol.Py4JJavaError: An error occurred while calling o135.start. : java.lang.UnsupportedClassVersionError: h2o/water/... Unsupported major.minor version 52.0 ↓
cause Java version mismatch (Java 8 required for this version).
fix
Set JAVA_HOME to Java 8: export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64
Warnings
breaking PySparkling 3.2+ requires Spark 3.2.x; PySparkling 3.1 requires Spark 3.1.x. Using wrong Spark version causes runtime errors. ↓
fix Match the major.minor version of h2o-pysparkling with your Spark version. For Spark 3.1.x, use h2o-pysparkling-3.1.
deprecated The H2OContext API has changed. Older code using H2OContext(sc) directly may fail; use H2OContext.getOrCreate(sc) or H2OContext(sc). ↓
fix Use H2OContext.getOrCreate(spark.sparkContext) or H2OContext(sc) depending on version. Check Sparkling Water changelog for exact changes.
gotcha PySparkling requires Java 8 or 11. Java 17+ is not supported and will cause cryptic errors. ↓
fix Set JAVA_HOME to Java 8 or 11 before starting Spark.
gotcha H2OContext must be initialized inside a Spark context (e.g., in a PySpark shell or Spark job). Running outside Spark (plain Python) fails with 'No SparkContext found'. ↓
fix Run code via spark-submit or pyspark shell.
Imports
- H2OContext wrong
from h2o import H2OContextcorrectfrom pysparkling import H2OContext - HC wrong
from h2o import HCcorrectfrom pysparkling import HC
Quickstart
from pyspark.sql import SparkSession
from pysparkling import H2OContext
spark = SparkSession.builder.appName('app').getOrCreate()
sc = spark.sparkContext
# Initialize H2OContext
h2o_context = H2OContext.getOrCreate(sc)
# Start H2O services
h2o_context.start()
print(f'H2O cluster status: {h2o_context.cluster().status()}')