Synapse Machine Learning (SynapseML)
SynapseML (formerly MMLSpark) is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines on Apache Spark. It provides simple, composable, and distributed APIs for various ML tasks such as text analytics, computer vision, anomaly detection, and deep learning. SynapseML seamlessly integrates with Azure AI services and OpenAI, allowing for large-scale intelligent systems. Currently at version 1.1.3, it maintains an active release cadence with frequent updates.
Warnings
- breaking SynapseML was formerly known as MMLSpark. Major package and namespace changes occurred during this renaming, requiring updates to import statements (e.g., `mmlspark.foo` became `synapse.ml.foo`).
- gotcha SynapseML has specific Apache Spark and Python version requirements. For example, SynapseML v1.1.3 typically requires Spark 3.4+ and Python 3.8+. Using incompatible versions can lead to installation failures or runtime errors.
- gotcha LightGBM training via SynapseML can be unstable on Spark clusters with dynamic resource allocation (e.g., autoscaling). Changes in executors during data processing can cause the training to hang or fail, as LightGBM's native distributed mode does not gracefully handle such networking changes.
- gotcha When training LightGBM with large datasets, especially in 'bulk execution mode' (default), users may encounter Java `OutOfMemoryError` (OOM) exceptions. This can occur even with seemingly sufficient executor memory, indicating potential issues with Spark's `spark.driver.maxResultSize` or the way data is transferred.
- deprecated On Microsoft Fabric, while SynapseML is preinstalled, installing or changing SynapseML versions using the `%%configure -f` magic command in notebooks is not officially supported and does not guarantee service-level agreement or future compatibility with official releases.
Install
-
pip install synapseml pyspark -
pyspark --packages com.microsoft.azure:synapseml_2.12:1.1.3 --repositories https://mmlspark.azureedge.net/maven
Imports
- SparkSession
from pyspark.sql import SparkSession
- OpenAIPrompt
from synapse.ml.services.openai import OpenAIPrompt
- LightGBMClassifier
from synapse.ml.lightgbm import LightGBMClassifier
- TextFeaturizer
from synapse.ml.featurize.text import TextFeaturizer
- AnalyzeText
from synapse.ml.services.language import AnalyzeText
- find_secret
from synapse.ml.core.platform import find_secret
Quickstart
import os
from pyspark.sql import SparkSession
from synapse.ml.services.openai import OpenAIPrompt
# Initialize Spark Session with SynapseML package
spark = SparkSession.builder \
.appName("SynapseML_OpenAI_Quickstart") \
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.1.3") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.getOrCreate()
# Prepare sample data
df = spark.createDataFrame([
("Explain quantum computing in simple terms.",),
("What are the benefits of exercise?",),
("Describe the water cycle.",)
]).toDF("prompt")
# Configure Azure OpenAI service details
# Replace with your actual deployment name and API key.
# For local testing, set OPENAI_API_KEY environment variable.
# In a Synapse/Databricks environment, use secret management (e.g., find_secret)
openai_api_key = os.environ.get('OPENAI_API_KEY', 'YOUR_OPENAI_API_KEY')
openai_deployment_name = os.environ.get('OPENAI_DEPLOYMENT_NAME', 'gpt-4.1')
if openai_api_key == 'YOUR_OPENAI_API_KEY':
print("WARNING: Please set OPENAI_API_KEY environment variable or replace 'YOUR_OPENAI_API_KEY' with your actual key.")
print("Skipping OpenAI interaction due to missing API key.")
else:
# Configure OpenAIPrompt for chat completions
prompt_completion = (
OpenAIPrompt()
.setSubscriptionKey(openai_api_key) # Use subscriptionKey for Azure OpenAI API Key
.setDeploymentName(openai_deployment_name)
.setApiType("chat_completions")
.setPromptCol("prompt")
.setUsageCol("usage")
.setOutputCol("completions")
)
# Transform and display results
result_df = prompt_completion.transform(df.repartition(1)).select("prompt", "completions.choices.message.content", "usage")
result_df.show(truncate=False)
# Stop Spark session
spark.stop()