Azure Synapse Spark Client Library
The Azure Synapse Spark Client Library for Python provides capabilities to interact with Azure Synapse Analytics Spark pools. It allows for submitting Spark batch jobs, managing Spark sessions, and interacting with Livy endpoints programmatically. As of version 0.7.0, it is part of the larger Azure SDK for Python ecosystem and typically sees updates aligned with new Synapse service features or general Azure SDK releases.
Common errors
-
ModuleNotFoundError: No module named 'com.microsoft.spark.sqlanalytics'
cause This error often occurs when trying to import the `com.microsoft.spark.sqlanalytics` connector in a Synapse Spark notebook. It usually means the necessary JAR for the Synapse SQL Pool connector is not correctly linked or installed, or the import path is incorrect for the Python context. The connector is typically a Java/Scala library, and direct Python `import` statements for `com.microsoft` packages might not work as expected without proper configuration or if a different API is intended. The `azure-synapse-spark` client library focuses on job submission, not direct in-notebook Spark connector imports of this nature.fixFor interacting with Azure Synapse SQL Pools from PySpark, you generally do not directly import `com.microsoft.spark.sqlanalytics` as a Python module. Instead, you use Spark's `read` and `write` APIs with the correct format and options provided by the connector. Ensure the Synapse SQL Pool connector JAR is attached to your Spark pool. If you are trying to write to a SQL Pool, the `df.write.synapsesql` method is often used without a direct `import com.microsoft.spark.sqlanalytics` statement in Python. A common fix is to ensure the required connector is available to the Spark pool and use the appropriate PySpark DataFrame methods. -
Livy Session has failed. Session state: error code: AVAILBLE_WORKSPACE_CAPACITY_EXCEEDED
cause This error indicates that the Spark pool has run out of available vCores or other resources, preventing a new Livy session from being created or an existing one from running. This often happens when multiple users are running jobs on the same Spark pool concurrently, or the requested resources for a job exceed the allocated quota.fixTo resolve this, try reducing the number of vCores requested by your Spark job, or increase the vCore quota for your Azure Synapse workspace. If multiple users are sharing a pool, consider configuring session-level settings for resource allocation, or create separate Spark pools for different workloads. Additionally, check if other active sessions are consuming resources unnecessarily and terminate them if possible. -
Py4JJavaError: An error occurred while calling oXXXX.load. : org.apache.hadoop.security.AccessControlException: GETFILESTATUS failed with error 0x83090aa2 (Forbidden. ACL verification failed. Either the resource does not exist or the user is not authorized to perform the requested operation.).
cause This `Py4JJavaError` typically occurs when your Spark job attempts to access data in Azure Data Lake Storage Gen2 (ADLS Gen2) but lacks the necessary permissions (e.g., 'Storage Blob Data Contributor' or 'Storage Blob Data Owner' RBAC role) on the storage account. This is particularly common when notebooks run within a pipeline, as the pipeline's Managed Identity (or Service Principal) needs permissions, not just the user's interactive login.fixGrant the appropriate Role-Based Access Control (RBAC) permissions (at least 'Storage Blob Data Contributor') to the Managed Identity of your Synapse Workspace or the user/Service Principal running the Spark job on the ADLS Gen2 storage account(s) being accessed. Ensure permissions are applied to the correct scope (container or file system root) and that all upstream folders have 'Execute' permissions. -
ModuleNotFoundError: No module named 'pyspark.errors'
cause This error primarily occurs in Azure Synapse Notebooks when a user attempts to install or explicitly import the `delta-spark` package, which can conflict with the native Delta Lake integration already present in Synapse's Spark runtime. Synapse often provides a pre-configured Delta Lake environment, and installing `delta-spark` can lead to dependency conflicts or an incorrect path resolution for `pyspark.errors`.fixInstead of installing `delta-spark`, leverage the native Delta Lake support in Azure Synapse. The `delta.tables` module and other Delta Lake functionalities are usually available directly without additional `pip install` commands. You should be able to `from delta.tables import DeltaTable` without issues. If you have installed `delta-spark`, try removing it or restarting the Spark pool and avoiding its installation.
Warnings
- breaking As a library in an early preview version (0.x.x), `azure-synapse-spark` may introduce breaking changes in minor version updates. Always review release notes when upgrading.
- gotcha Many client methods, like `get_spark_batch_jobs`, require explicit `workspace_name` and `spark_pool_name` arguments, even if the workspace name is implicitly part of the client's `endpoint` URL. Ensure these parameters are consistently provided.
- gotcha Authentication with Azure services using `DefaultAzureCredential` relies on specific environment variables (e.g., `AZURE_CLIENT_ID`, `AZURE_CLIENT_SECRET`, `AZURE_TENANT_ID`), being logged in via Azure CLI (`az login`), or other Azure Identity sources. Without proper setup, authentication will fail.
- breaking The `azure-synapse-spark` library depends on `six`, but it appears not to be installed automatically, leading to a `ModuleNotFoundError` during import. This indicates a potential issue with the library's declared dependencies.
- gotcha The `azure-synapse-spark` library depends on `six`. If `six` is not installed, you will encounter a `ModuleNotFoundError`. This can happen if transitive dependencies are not correctly resolved by your package manager or if using minimal Python environments.
Install
-
pip install azure-synapse-spark azure-identity
Imports
- SparkClient
from azure.synapse.spark import SparkClient
- DefaultAzureCredential
from azure.identity import DefaultAzureCredential
Quickstart
import os
from azure.identity import DefaultAzureCredential
from azure.synapse.spark import SparkClient
# Replace with your Synapse workspace name and a Spark pool name
synapse_workspace_name = os.environ.get("SYNAPSE_WORKSPACE_NAME", "your_synapse_workspace_name")
spark_pool_name = os.environ.get("SYNAPSE_SPARK_POOL_NAME", "your_spark_pool_name")
endpoint = f"https://{synapse_workspace_name}.dev.azuresynapse.net"
if synapse_workspace_name == "your_synapse_workspace_name" or spark_pool_name == "your_spark_pool_name":
print("Please set SYNAPSE_WORKSPACE_NAME and SYNAPSE_SPARK_POOL_NAME environment variables ",
"or replace the placeholder values in the code.")
else:
try:
# Obtain a credential from Azure Identity. Ensure you're logged in via Azure CLI/VS Code, or env vars are set.
credential = DefaultAzureCredential()
# Create a SparkClient
spark_client = SparkClient(endpoint=endpoint, credential=credential)
# List Spark batch jobs in a specific pool (example operation)
print(f"Listing Spark batch jobs for Spark Pool '{spark_pool_name}' in workspace '{synapse_workspace_name}'...")
batch_jobs_collection = spark_client.spark_batch.get_spark_batch_jobs(
workspace_name=synapse_workspace_name,
spark_pool_name=spark_pool_name
)
print(f"Found {len(batch_jobs_collection.value)} Spark batch jobs:")
for job in batch_jobs_collection.value:
print(f" - Job ID: {job.id}, Name: {job.name}, State: {job.state}")
except Exception as e:
print(f"Error interacting with Azure Synapse Spark: {e}")
print("Ensure your Azure credentials are set up and you have permissions to the Synapse workspace and Spark pool.")