{"id":3289,"library":"synapseml","title":"Synapse Machine Learning (SynapseML)","description":"SynapseML (formerly MMLSpark) is an open-source library that simplifies the creation of massively scalable machine learning (ML) pipelines on Apache Spark. It provides simple, composable, and distributed APIs for various ML tasks such as text analytics, computer vision, anomaly detection, and deep learning. SynapseML seamlessly integrates with Azure AI services and OpenAI, allowing for large-scale intelligent systems. Currently at version 1.1.3, it maintains an active release cadence with frequent updates.","status":"active","version":"1.1.3","language":"en","source_language":"en","source_url":"https://github.com/Microsoft/SynapseML","tags":["machine learning","spark","distributed computing","ai","azure","microsoft","deep learning","big data"],"install":[{"cmd":"pip install synapseml pyspark","lang":"bash","label":"For local Python environments"},{"cmd":"pyspark --packages com.microsoft.azure:synapseml_2.12:1.1.3 --repositories https://mmlspark.azureedge.net/maven","lang":"bash","label":"For Spark environments (e.g., spark-submit, pyspark shell)"}],"dependencies":[{"reason":"SynapseML is built on Apache Spark and requires pyspark for Python environments.","package":"pyspark","optional":false}],"imports":[{"symbol":"SparkSession","correct":"from pyspark.sql import SparkSession"},{"symbol":"OpenAIPrompt","correct":"from synapse.ml.services.openai import OpenAIPrompt"},{"symbol":"LightGBMClassifier","correct":"from synapse.ml.lightgbm import LightGBMClassifier"},{"symbol":"TextFeaturizer","correct":"from synapse.ml.featurize.text import TextFeaturizer"},{"symbol":"AnalyzeText","correct":"from synapse.ml.services.language import AnalyzeText"},{"note":"SynapseML was formerly MMLSpark; namespaces have changed.","wrong":"import mmlspark.core.platform.find_secret","symbol":"find_secret","correct":"from synapse.ml.core.platform import find_secret"}],"quickstart":{"code":"import os\nfrom pyspark.sql import SparkSession\nfrom synapse.ml.services.openai import OpenAIPrompt\n\n# Initialize Spark Session with SynapseML package\nspark = SparkSession.builder \\\n    .appName(\"SynapseML_OpenAI_Quickstart\") \\\n    .config(\"spark.jars.packages\", \"com.microsoft.azure:synapseml_2.12:1.1.3\") \\\n    .config(\"spark.jars.repositories\", \"https://mmlspark.azureedge.net/maven\") \\\n    .getOrCreate()\n\n# Prepare sample data\ndf = spark.createDataFrame([\n    (\"Explain quantum computing in simple terms.\",),\n    (\"What are the benefits of exercise?\",),\n    (\"Describe the water cycle.\",)\n]).toDF(\"prompt\")\n\n# Configure Azure OpenAI service details\n# Replace with your actual deployment name and API key. \n# For local testing, set OPENAI_API_KEY environment variable.\n# In a Synapse/Databricks environment, use secret management (e.g., find_secret)\nopenai_api_key = os.environ.get('OPENAI_API_KEY', 'YOUR_OPENAI_API_KEY')\nopenai_deployment_name = os.environ.get('OPENAI_DEPLOYMENT_NAME', 'gpt-4.1')\n\nif openai_api_key == 'YOUR_OPENAI_API_KEY':\n    print(\"WARNING: Please set OPENAI_API_KEY environment variable or replace 'YOUR_OPENAI_API_KEY' with your actual key.\")\n    print(\"Skipping OpenAI interaction due to missing API key.\")\nelse:\n    # Configure OpenAIPrompt for chat completions\n    prompt_completion = (\n        OpenAIPrompt()\n        .setSubscriptionKey(openai_api_key) # Use subscriptionKey for Azure OpenAI API Key\n        .setDeploymentName(openai_deployment_name)\n        .setApiType(\"chat_completions\")\n        .setPromptCol(\"prompt\")\n        .setUsageCol(\"usage\")\n        .setOutputCol(\"completions\")\n    )\n\n    # Transform and display results\n    result_df = prompt_completion.transform(df.repartition(1)).select(\"prompt\", \"completions.choices.message.content\", \"usage\")\n    result_df.show(truncate=False)\n\n# Stop Spark session\nspark.stop()\n","lang":"python","description":"This quickstart demonstrates how to set up a Spark session with SynapseML, create a DataFrame with sample prompts, and use the `OpenAIPrompt` transformer to interact with an Azure OpenAI service for chat completions. It includes placeholders for API key and deployment name, which should be configured via environment variables or a secure secret management system in production."},"warnings":[{"fix":"Update all import statements from `mmlspark.xyz` to `synapse.ml.xyz`.","message":"SynapseML was formerly known as MMLSpark. Major package and namespace changes occurred during this renaming, requiring updates to import statements (e.g., `mmlspark.foo` became `synapse.ml.foo`).","severity":"breaking","affected_versions":"<=0.18 (MMLSpark) to >=1.0.0 (SynapseML)"},{"fix":"Always consult the official SynapseML documentation or GitHub README for the exact Spark and Python version compatibility matrix for your chosen SynapseML version.","message":"SynapseML has specific Apache Spark and Python version requirements. For example, SynapseML v1.1.3 typically requires Spark 3.4+ and Python 3.8+. Using incompatible versions can lead to installation failures or runtime errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"For critical LightGBM training jobs, consider disabling cluster autoscaling, setting a fixed (smaller) number of executors, or increasing Spark's executor heartbeat interval and network timeouts. Splitting data into smaller batches with `numBatches` can also improve reliability at the cost of increased total processing time.","message":"LightGBM training via SynapseML can be unstable on Spark clusters with dynamic resource allocation (e.g., autoscaling). Changes in executors during data processing can cause the training to hang or fail, as LightGBM's native distributed mode does not gracefully handle such networking changes.","severity":"gotcha","affected_versions":"All versions using LightGBM"},{"fix":"Increase `spark.driver.maxResultSize` and executor memory, and consider adjusting the `numBatches` parameter in LightGBM, although this might impact performance. Monitor Spark UI metrics to understand actual memory utilization.","message":"When training LightGBM with large datasets, especially in 'bulk execution mode' (default), users may encounter Java `OutOfMemoryError` (OOM) exceptions. This can occur even with seemingly sufficient executor memory, indicating potential issues with Spark's `spark.driver.maxResultSize` or the way data is transferred.","severity":"gotcha","affected_versions":"All versions using LightGBM with large datasets"},{"fix":"For production workloads on Fabric, rely on the preinstalled SynapseML version or consult official Microsoft Fabric documentation for supported methods of library management. Use `%%configure` at your own risk for experimental purposes.","message":"On Microsoft Fabric, while SynapseML is preinstalled, installing or changing SynapseML versions using the `%%configure -f` magic command in notebooks is not officially supported and does not guarantee service-level agreement or future compatibility with official releases.","severity":"deprecated","affected_versions":"All versions on Microsoft Fabric"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}