Synapse Machine Learning (SynapseML)

1.1.3 · active · verified Sat Apr 11

SynapseML (formerly MMLSpark) is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines on Apache Spark. It provides simple, composable, and distributed APIs for various ML tasks such as text analytics, computer vision, anomaly detection, and deep learning. SynapseML seamlessly integrates with Azure AI services and OpenAI, allowing for large-scale intelligent systems. Currently at version 1.1.3, it maintains an active release cadence with frequent updates.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to set up a Spark session with SynapseML, create a DataFrame with sample prompts, and use the `OpenAIPrompt` transformer to interact with an Azure OpenAI service for chat completions. It includes placeholders for API key and deployment name, which should be configured via environment variables or a secure secret management system in production.

import os
from pyspark.sql import SparkSession
from synapse.ml.services.openai import OpenAIPrompt

# Initialize Spark Session with SynapseML package
spark = SparkSession.builder \
    .appName("SynapseML_OpenAI_Quickstart") \
    .config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.1.3") \
    .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
    .getOrCreate()

# Prepare sample data
df = spark.createDataFrame([
    ("Explain quantum computing in simple terms.",),
    ("What are the benefits of exercise?",),
    ("Describe the water cycle.",)
]).toDF("prompt")

# Configure Azure OpenAI service details
# Replace with your actual deployment name and API key. 
# For local testing, set OPENAI_API_KEY environment variable.
# In a Synapse/Databricks environment, use secret management (e.g., find_secret)
openai_api_key = os.environ.get('OPENAI_API_KEY', 'YOUR_OPENAI_API_KEY')
openai_deployment_name = os.environ.get('OPENAI_DEPLOYMENT_NAME', 'gpt-4.1')

if openai_api_key == 'YOUR_OPENAI_API_KEY':
    print("WARNING: Please set OPENAI_API_KEY environment variable or replace 'YOUR_OPENAI_API_KEY' with your actual key.")
    print("Skipping OpenAI interaction due to missing API key.")
else:
    # Configure OpenAIPrompt for chat completions
    prompt_completion = (
        OpenAIPrompt()
        .setSubscriptionKey(openai_api_key) # Use subscriptionKey for Azure OpenAI API Key
        .setDeploymentName(openai_deployment_name)
        .setApiType("chat_completions")
        .setPromptCol("prompt")
        .setUsageCol("usage")
        .setOutputCol("completions")
    )

    # Transform and display results
    result_df = prompt_completion.transform(df.repartition(1)).select("prompt", "completions.choices.message.content", "usage")
    result_df.show(truncate=False)

# Stop Spark session
spark.stop()

view raw JSON →