H2O-3: Fast Scalable Machine Learning
H2O-3 is an open-source, in-memory, distributed, fast, and scalable machine learning platform primarily implemented in Java with a Python client. It offers a wide array of common machine learning algorithms including GLM, Gradient Boosting, Deep Learning, XGBoost, and Isolation Forest. The current version is 3.46.0.10. Releases are frequent, typically on a monthly or bi-monthly cadence, reflecting active development and continuous improvement.
Common errors
-
h2o.exceptions.H2OConnectionError: H2O connection broken!
cause The H2O cluster failed to start or unexpectedly disconnected, often due to a missing/incompatible JRE, insufficient memory, or a port conflict.fixEnsure Java 8+ is installed and in PATH. Increase `max_mem_size` in `h2o.init()`. If starting multiple clusters, specify a unique `port`, e.g., `h2o.init(port=54321)`. -
java.lang.OutOfMemoryError: Java heap space
cause The H2O JVM process ran out of allocated memory while attempting to store data or build a model. The default 1GB is often insufficient.fixIncrease the maximum memory size for the H2O cluster during initialization: `h2o.init(max_mem_size='8G')` (adjust '8G' to a suitable value based on your system's available RAM). -
AttributeError: 'pandas.core.frame.DataFrame' object has no attribute 'asfactor'
cause You are attempting to use an H2OFrame-specific method (like `.asfactor()`, `.split_frame()`, etc.) directly on a Pandas DataFrame.fixConvert your Pandas DataFrame to an H2OFrame first: `h2o_frame = h2o.H2OFrame(pandas_df)`. -
ModuleNotFoundError: No module named 'h2o'
cause The `h2o` Python package is not installed in your current Python environment.fixInstall the package using pip: `pip install h2o`.
Warnings
- gotcha H2O requires a Java Runtime Environment (JRE) (Java 8 or higher is recommended) to operate its backend cluster. Ensure Java is installed and its executable is accessible in your system's PATH. Without a compatible JRE, `h2o.init()` will fail to start the cluster.
- gotcha The H2O JVM process, started by `h2o.init()`, defaults to allocating 1GB of Java heap space. For larger datasets or complex models, this is often insufficient, leading to `java.lang.OutOfMemoryError`. You must explicitly allocate enough memory.
- gotcha H2O DataFrames (`h2o.H2OFrame`) are distinct from Pandas DataFrames. Direct operations attempting to mix them or use Pandas methods on an H2OFrame (or vice-versa) will result in errors. Explicit conversion is always required.
- gotcha When `h2o.init()` starts a local H2O cluster, it consumes system resources. Failing to call `h2o.shutdown()` at the end of your H2O session (especially in scripts or notebooks) can leave lingering Java processes, leading to resource leaks or port conflicts.
Install
-
pip install h2o
Imports
- h2o
import h2o
- H2OFrame
from h2o import H2OFrame
Quickstart
import h2o
from h2o.estimators.gbm import H2OGradientBoostingEstimator
import pandas as pd
# Initialize H2O cluster (adjust max_mem_size based on your system's RAM and data size)
h2o.init(max_mem_size="4G", nthreads=-1)
# Create a sample Pandas DataFrame
data = {
'feature1': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'feature2': [10, 9, 8, 7, 6, 5, 4, 3, 2, 1],
'target': [0, 1, 0, 1, 0, 1, 0, 1, 0, 1]
}
df_pandas = pd.DataFrame(data)
# Convert Pandas DataFrame to H2OFrame
df_h2o = h2o.H2OFrame(df_pandas)
# Convert target to factor for classification problems
df_h2o['target'] = df_h2o['target'].asfactor()
# Define predictors and response variables
predictors = ['feature1', 'feature2']
response = 'target'
# Split data into training and testing sets
train, test = df_h2o.split_frame(ratios=[0.7], seed=42)
# Build a Gradient Boosting Machine (GBM) model
gbm_model = H2OGradientBoostingEstimator(
ntrees=50,
max_depth=5,
seed=42
)
gbm_model.train(x=predictors, y=response, training_frame=train)
# Make predictions on the test set
predictions = gbm_model.predict(test)
print("\nPredictions on test data (first 5 rows):\n")
print(predictions.head())
# Shutdown H2O cluster (crucial for resource management)
h2o.shutdown(prompt=False)