Delta Lake Python
Deltalake is an open-source Python library providing native Delta Lake bindings based on the `delta-rs` Rust library, offering efficient and robust interaction with Delta Lake tables without requiring Apache Spark or JVM dependencies. It includes seamless integration with data manipulation libraries like Pandas, Polars, and PyArrow. The library is actively developed, with its current version being 1.5.0, and receives frequent updates to enhance performance and features.
Warnings
- breaking In `deltalake` v1.5.0, the `get_add_actions` method now returns an `ArrowTable` instead of an `ArrowRecordBatch`. Code relying on the specific `ArrowRecordBatch` type or its API will break.
- breaking Checkpoint schema changes between `deltalake` versions, notably around `0.25.5` and `1.0.2`, can lead to `DeltaError: Failed to parse parquet: Arrow: Incompatible type` when attempting to read or create checkpoints from older tables, especially if `nullable` properties for fields like `path`, `size`, `modificationTime` changed from `True` to `False`.
- gotcha The `deltalake` Python library is a native implementation distinct from `delta-spark`. While both interact with Delta Lake, `deltalake` does not require Apache Spark or a JVM. Ensure you are using the correct library for your ecosystem, as `delta-spark` imports (e.g., `from delta.tables import DeltaTable`) are not compatible with `deltalake`.
- gotcha Concurrent write operations (e.g., multiple processes appending or updating a table simultaneously) can lead to `ConcurrentAppendException`, `ConcurrentDeleteReadException`, or `ConcurrentModificationException` due to optimistic concurrency control. While Delta Lake guarantees ACID properties, conflicts require handling.
- gotcha Operations like `DeltaTable.delete()` or `write_deltalake(mode="overwrite")` only mark files for deletion in the Delta transaction log. The physical files are not immediately removed from storage. This can lead to increased storage costs if not managed.
- gotcha Some functionalities, especially around `MERGE` operations, might require configuring disk spilling for large datasets to avoid out-of-memory errors.
Install
-
pip install deltalake pandas pyarrow
Imports
- DeltaTable
from deltalake import DeltaTable
- write_deltalake
from deltalake import write_deltalake
Quickstart
import pandas as pd
from deltalake import write_deltalake, DeltaTable
import os
# Define a Delta Lake table path
table_path = "./tmp_delta_table"
# Ensure the directory exists or is cleaned up for a fresh start
if os.path.exists(table_path):
import shutil
shutil.rmtree(table_path)
# 1. Create a Pandas DataFrame
df = pd.DataFrame({"id": [1, 2], "value": ["A", "B"]})
# 2. Write the DataFrame to a Delta Lake table
write_deltalake(table_path, df)
print(f"Initial Delta table created at: {table_path}")
# 3. Load the Delta table
dt = DeltaTable(table_path)
print(f"Current table version: {dt.version()}")
print("Current table data:")
print(dt.to_pandas().to_markdown(index=False))
# 4. Append new data to the table
new_df = pd.DataFrame({"id": [3, 4], "value": ["C", "D"]})
write_deltalake(table_path, new_df, mode="append")
print("\nData appended. New table version:")
dt_updated = DeltaTable(table_path)
print(f"Current table version: {dt_updated.version()}")
print("Updated table data:")
print(dt_updated.to_pandas().to_markdown(index=False))
# 5. Read an older version of the table (Time Travel)
dt_v0 = DeltaTable(table_path, version=0)
print("\nData from version 0 (time travel):")
print(dt_v0.to_pandas().to_markdown(index=False))
# Clean up temporary files (optional)
# shutil.rmtree(table_path)