Delta Lake Python APIs for Apache Spark
delta-spark provides Python APIs to interact with Delta Lake tables using Apache Spark. It enables operations like reading, writing, and time-traveling Delta tables, leveraging Spark's distributed processing capabilities. The library maintains a rapid release cadence, often releasing multiple patch and minor versions for each major iteration. The current version is 4.1.0.
Warnings
- breaking The preview feature for catalog-managed tables was renamed from `catalogOwned-preview` to `catalogManaged` in v4.0.1. Legacy `ucTableId` also transitioned to `io.unitycatalog.tableId`.
- gotcha Each `delta-spark` release is built against and optimized for specific Apache Spark versions. While some backward compatibility exists (e.g., Delta 4.1.0 supports Spark 4.1.0 and 4.0.1), major Spark version upgrades can introduce incompatibilities or require specific `delta-spark` versions.
- gotcha The 'catalog-managed tables' feature introduced in v4.0.0 (preview) was explicitly stated to be in an RFC stage and 'subject to change'. Early adopters of this feature in v4.0.0 experienced breaking changes in v4.0.1.
- gotcha Starting with version 4.x, `delta-spark` requires Python 3.10 or newer.
Install
-
pip install delta-spark
Imports
- DeltaTable
from delta.tables import DeltaTable
- SparkSession config for Delta
SparkSession.builder.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension").config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
Quickstart
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
import os
# Configure SparkSession for Delta Lake
spark = (
SparkSession.builder.appName("DeltaSparkQuickstart")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config(
"spark.sql.catalog.spark_catalog",
"org.apache.spark.sql.delta.catalog.DeltaCatalog",
)
.getOrCreate()
)
# Create a simple DataFrame
data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])
# Define a path for the Delta table
delta_table_path = os.path.join(os.getcwd(), "tmp", "delta_table")
# Write data to a Delta table
print(f"Writing data to Delta table at: {delta_table_path}")
data.write.format("delta").mode("overwrite").save(delta_table_path)
# Read data from the Delta table
print(f"Reading data from Delta table at: {delta_table_path}")
df_read = spark.read.format("delta").load(delta_table_path)
df_read.show()
# Use DeltaTable API for operations (e.g., detail)
delta_table = DeltaTable.forPath(spark, delta_table_path)
print("Delta table description:")
delta_table.detail().show()
# Stop SparkSession
spark.stop()