Delta Lake Python APIs for Apache Spark
raw JSON → 4.1.0 verified Tue May 12 auth: no python install: verified quickstart: stale
delta-spark provides Python APIs to interact with Delta Lake tables using Apache Spark. It enables operations like reading, writing, and time-traveling Delta tables, leveraging Spark's distributed processing capabilities. The library maintains a rapid release cadence, often releasing multiple patch and minor versions for each major iteration. The current version is 4.1.0.
pip install delta-spark Common errors
error ModuleNotFoundError: No module named 'delta' ↓
cause This error typically occurs when the `delta-spark` Python package is not installed in the environment where your PySpark application is running, or when the Delta Lake JARs are not correctly linked with your Spark session, preventing the Python wrapper from finding the necessary Delta modules.
fix
Ensure
For production clusters (e.g., Databricks, EMR, Synapse), ensure the Delta Lake runtime/library is attached to your cluster, and avoid
delta-spark is installed via pip install delta-spark. Additionally, your SparkSession must be configured to use Delta Lake. For local PySpark, configure the SparkSession with the appropriate Delta Lake packages and extensions:
from pyspark.sql import SparkSession
from delta import configure_spark_with_delta_pip
builder = SparkSession.builder \
.appName("DeltaSparkApp") \
.master("local[*]") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
spark = configure_spark_with_delta_pip(builder).getOrCreate()
# Now you can import from delta.tables
from delta.tables import DeltaTable pip install delta-spark if a native version is provided, as it can cause conflicts. error Py4JJavaError: An error occurred while calling oXX.save. : java.lang.ClassNotFoundException: Failed to find data source: delta ↓
cause This Java exception indicates that the underlying Spark JVM cannot find the Delta Lake connector classes because the required Delta Lake JARs are not included in Spark's classpath or are not correctly configured when the SparkSession is initialized.
fix
Ensure your SparkSession is configured with the correct Delta Lake packages. When submitting Spark jobs, use the
If creating a SparkSession programmatically, add these configurations:
--packages option with spark-submit. For example, for delta-spark version 4.1.0 and Spark 3.x with Scala 2.12:
spark-submit \
--packages io.delta:delta-core_2.12:4.1.0 \
--conf "spark.sql.extensions=io.delta.sql.DeltaSparkSessionExtension" \
--conf "spark.sql.catalog.spark_catalog=org.apache.spark.sql.delta.catalog.DeltaCatalog" \
your_script.py spark = SparkSession.builder \
.appName("DeltaApp") \
.config("spark.jars.packages", "io.delta:delta-core_2.12:4.1.0") \
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
.getOrCreate() error pyspark.sql.utils.AnalysisException: `path/table` is not a Delta table. ↓
cause This error occurs when Spark attempts to perform a Delta Lake-specific operation (like reading a Delta table or calling `DeltaTable.forPath()`) on a path or table that does not contain a valid Delta transaction log (`_delta_log` directory) or when the SparkSession is not properly configured to recognize Delta tables.
fix
First, ensure the path you are providing points to an actual Delta table (a directory containing
_delta_log). Second, verify that your SparkSession is correctly configured with the Delta Lake extensions and catalog. Refer to the fix for java.lang.ClassNotFoundException to ensure these configurations are in place. If using DeltaTable.createIfNotExists(), ensure absolute paths are used, especially in local environments. error ModuleNotFoundError: No module named 'pyspark.errors' ↓
cause This specific `ModuleNotFoundError` is commonly encountered in environments like Azure Synapse Notebooks when the `delta-spark` package is installed via `pip`. It indicates a conflict with the native Delta Lake and PySpark libraries pre-installed or integrated into such platforms, where the installed `delta-spark` package tries to import `pyspark.errors` which might not be exposed or structured in the same way by the platform's native PySpark distribution.
fix
In Azure Synapse Notebooks (and potentially other managed Spark environments), avoid explicitly installing
delta-spark via %pip install delta-spark. Instead, the delta.tables module is usually available directly from the pre-configured Spark environment. Remove the pip install delta-spark command and directly use from delta.tables import DeltaTable. error AttributeError: 'DeltaMergeBuilder' object has no attribute 'withSchemaEvolution' ↓
cause This error typically indicates a version mismatch, where you are attempting to use a feature (like `withSchemaEvolution`) that is available in a newer version of Delta Lake, but your current `delta-spark` library or the underlying Databricks Runtime/Spark environment is an older version that does not support it.
fix
Ensure that your
delta-spark and PySpark versions are compatible and that the Delta Lake feature you are trying to use is supported by your runtime. For withSchemaEvolution, this feature became available in Databricks Runtime 16.0+. Either upgrade your delta-spark package and Spark runtime to a version that supports the feature, or use an alternative method for schema evolution such as setting the Spark configuration spark.databricks.delta.schema.autoMerge.enabled to true or using .option("mergeSchema", "true") on DataFrame writes. Warnings
breaking The preview feature for catalog-managed tables was renamed from `catalogOwned-preview` to `catalogManaged` in v4.0.1. Legacy `ucTableId` also transitioned to `io.unitycatalog.tableId`. ↓
fix If using catalog-managed tables in v4.0.0, update Spark configurations and any code referencing the feature to use `catalogManaged` and `io.unitycatalog.tableId` when upgrading to v4.0.1 or later.
gotcha Each `delta-spark` release is built against and optimized for specific Apache Spark versions. While some backward compatibility exists (e.g., Delta 4.1.0 supports Spark 4.1.0 and 4.0.1), major Spark version upgrades can introduce incompatibilities or require specific `delta-spark` versions. ↓
fix Always consult the official Delta Lake release notes and documentation to ensure your `delta-spark` and `pyspark` versions are compatible. Upgrade both in tandem if necessary.
gotcha The 'catalog-managed tables' feature introduced in v4.0.0 (preview) was explicitly stated to be in an RFC stage and 'subject to change'. Early adopters of this feature in v4.0.0 experienced breaking changes in v4.0.1. ↓
fix For production systems, exercise caution with 'preview' features. If using such features, be prepared for API changes and thoroughly test upgrades. Migrate to the stable naming conventions in later versions.
gotcha Starting with version 4.x, `delta-spark` requires Python 3.10 or newer. ↓
fix Ensure your Python environment is 3.10 or higher. You can check your Python version using `python --version`.
breaking PySpark requires a Java Runtime Environment (JRE) to be installed and the `JAVA_HOME` environment variable to be set, pointing to the JRE/JDK installation directory. Without it, the Java gateway cannot start, leading to `PySparkRuntimeError: [JAVA_GATEWAY_EXITED]`. ↓
fix Ensure a Java Development Kit (JDK) or Java Runtime Environment (JRE) is installed and the `JAVA_HOME` environment variable is correctly set to its installation path. For example, `export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64`.
breaking PySpark's Java gateway startup process relies on common shell utilities (like `bash`, `sh`, `env`) for environment setup. In minimal Linux distributions, such as Alpine, these utilities or their symlinks may not be installed by default, leading to errors like 'env: can't execute 'bash': No such file or directory' and subsequent PySparkRuntimeError: [JAVA_GATEWAY_EXITED] failures. ↓
fix For minimal container environments (e.g., Alpine Linux), ensure that essential shell utilities (like `bash`) and Java Runtime Environment (JRE) are explicitly installed. For Alpine, this typically involves adding `apk add bash openjdk17-jre` (or a suitable JRE version) to your Dockerfile.
Install compatibility verified last tested: 2026-05-12
python os / libc status wheel install import disk
3.10 alpine (musl) sdist - 0.47s 505.9M
3.10 alpine (musl) - - 0.46s 505.9M
3.10 slim (glibc) sdist 31.6s 0.34s 506M
3.10 slim (glibc) - - 0.30s 506M
3.11 alpine (musl) sdist - 0.68s 512.0M
3.11 alpine (musl) - - 0.72s 511.9M
3.11 slim (glibc) sdist 31.2s 0.61s 513M
3.11 slim (glibc) - - 0.57s 512M
3.12 alpine (musl) sdist - 0.56s 501.0M
3.12 alpine (musl) - - 0.58s 500.9M
3.12 slim (glibc) sdist 32.0s 0.62s 501M
3.12 slim (glibc) - - 0.60s 501M
3.13 alpine (musl) sdist - 0.56s 500.3M
3.13 alpine (musl) - - 0.59s 500.1M
3.13 slim (glibc) sdist 30.6s 0.56s 501M
3.13 slim (glibc) - - 0.58s 501M
3.9 alpine (musl) sdist - 0.44s 484.2M
3.9 alpine (musl) - - 0.41s 484.2M
3.9 slim (glibc) sdist 31.2s 0.39s 485M
3.9 slim (glibc) - - 0.37s 485M
Imports
- DeltaTable
from delta.tables import DeltaTable - SparkSession config for Delta
SparkSession.builder.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension").config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
Quickstart stale last tested: 2026-04-24
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
import os
# Configure SparkSession for Delta Lake
spark = (
SparkSession.builder.appName("DeltaSparkQuickstart")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
.getOrCreate()
)
# Create a simple DataFrame
data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
# Define a path for the Delta table
delta_table_path = os.path.join(os.getcwd(), "tmp", "delta_table")
# Write data to a Delta table
print(f"Writing data to Delta table at: {delta_table_path}")
data.write.format("delta").mode("overwrite").save(delta_table_path)
# Read data from the Delta table
print(f"Reading data from Delta table at: {delta_table_path}")
df_read = spark.read.format("delta").load(delta_table_path)
df_read.show()
# Use DeltaTable API for operations (e.g., detail)
delta_table = DeltaTable.forPath(spark, delta_table_path)
print("Delta table description:")
delta_table.detail().show()
# Stop SparkSession
spark.stop()