MLeap Python API
MLeap is a serialization format and a runtime for machine learning pipelines. It allows you to train models using Apache Spark, Scikit-learn, or XGBoost, and then serialize them into a portable format that can be served in real-time without Spark dependencies. The Python API, currently at version 0.24.0, provides tools for training, exporting, and running these pipelines. It supports Python 3.9+, Scala 2.13, Spark 4.0.1, and Java 17. Releases are semi-regular, often driven by upstream library updates.
Warnings
- breaking MLeap versions often align with major upgrades of underlying platforms (Java, Spark, XGBoost, TensorFlow). For example, v0.24.0 requires Java 17, Spark 4.0.1, and XGBoost 2.0.3, a significant change from prior versions (e.g., v0.22.0 supported Spark 3.3.0). This can cause compatibility issues if your runtime environment does not match the version MLeap was built against.
- gotcha MLeap fundamentally relies on a Java Virtual Machine (JVM) for its runtime, and the Python API interacts with this JVM via Py4J. Common issues include not having a compatible JVM installed (e.g., Java 17 for v0.24.0) or incorrect `JAVA_HOME` / classpath configuration, leading to `NoClassDefFoundError`, `JVM not found`, or `Py4JError` errors.
- gotcha Models exported using one MLeap Python API version might not be compatible with an MLeap Scala/JVM runtime of a different version (and vice-versa). Serialization formats can change between releases, leading to load/prediction failures with `UnsupportedBundleFileVersionException` or similar when attempting to run models with mismatched versions.
- gotcha When exporting models, MLeap requires a sample DataFrame (or similar structure) to accurately infer the input schema of the model. Providing incorrect or incomplete input data during the `Bundle().writer` step can lead to models that fail to load or predict correctly at runtime due to schema mismatches, leading to runtime errors or unexpected behavior.
Install
-
pip install mleap -
pip install mleap[spark] -
pip install mleap[onnx]
Imports
- MLeapPipeline
from mleap.sklearn.pipeline import MLeapPipeline
- Bundle
from mleap.bundle import Bundle
- MLeapContext
from mleap.runtime import MLeapContext
Quickstart
import pandas as pd
import numpy as np
import os
import shutil
from sklearn.linear_model import LinearRegression
from mleap.sklearn.pipeline import MLeapPipeline
from mleap.bundle import Bundle
# 1. Prepare sample data
data = {
'feature1': np.random.rand(10),
'feature2': np.random.rand(10)
}
df = pd.DataFrame(data)
target = np.random.rand(10)
# 2. Train a scikit-learn model
model_sklearn = LinearRegression()
model_sklearn.fit(df[['feature1', 'feature2']], target)
# 3. Wrap the scikit-learn model in an MLeapPipeline
mleap_pipeline = MLeapPipeline([
('lr', model_sklearn)
])
# 4. Define export path and clean up previous exports
bundle_path = "/tmp/my_linear_regression_mleap.zip"
model_name = "linear_regression_mleap_model"
if os.path.exists(bundle_path):
os.remove(bundle_path)
if os.path.exists(f"/tmp/{model_name}"):
shutil.rmtree(f"/tmp/{model_name}")
# 5. Export the MLeap pipeline to a bundle file
with Bundle().writer(mleap_pipeline, df[['feature1', 'feature2']], name=model_name) as writer:
writer.serialize_to_zip(bundle_path)
print(f"MLeap model exported to: {bundle_path}")
# 6. Load the MLeap bundle back into memory
loaded_bundle = Bundle.load_model(bundle_path)
# 7. Make predictions with the loaded model
test_data = pd.DataFrame([[0.1, 0.9]], columns=['feature1', 'feature2'])
predictions = loaded_bundle.predict(test_data)
print(f"Predictions: {predictions}")
# 8. Clean up created files (optional)
if os.path.exists(bundle_path):
os.remove(bundle_path)
if os.path.exists(f"/tmp/{model_name}"):
shutil.rmtree(f"/tmp/{model_name}")