Delta Lake Python APIs for Apache Spark

4.1.0 · active · verified Sat Mar 28

delta-spark provides Python APIs to interact with Delta Lake tables using Apache Spark. It enables operations like reading, writing, and time-traveling Delta tables, leveraging Spark's distributed processing capabilities. The library maintains a rapid release cadence, often releasing multiple patch and minor versions for each major iteration. The current version is 4.1.0.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize a SparkSession with Delta Lake extensions, write a DataFrame to a Delta table, and then read the data back. It also shows how to get details about the Delta table using the `DeltaTable` API.

from pyspark.sql import SparkSession
from delta.tables import DeltaTable
import os

# Configure SparkSession for Delta Lake
spark = (
    SparkSession.builder.appName("DeltaSparkQuickstart")
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
    .config(
        "spark.sql.catalog.spark_catalog",
        "org.apache.spark.sql.delta.catalog.DeltaCatalog",
    )
    .getOrCreate()
)

# Create a simple DataFrame
data = spark.createDataFrame([(1, "Alice"), (2, "Bob")], ["id", "name"])

# Define a path for the Delta table
delta_table_path = os.path.join(os.getcwd(), "tmp", "delta_table")

# Write data to a Delta table
print(f"Writing data to Delta table at: {delta_table_path}")
data.write.format("delta").mode("overwrite").save(delta_table_path)

# Read data from the Delta table
print(f"Reading data from Delta table at: {delta_table_path}")
df_read = spark.read.format("delta").load(delta_table_path)
df_read.show()

# Use DeltaTable API for operations (e.g., detail)
delta_table = DeltaTable.forPath(spark, delta_table_path)
print("Delta table description:")
delta_table.detail().show()

# Stop SparkSession
spark.stop()

view raw JSON →