PyIceberg-Core
PyIceberg-core is a foundational Python library that provides a Rust-powered core for PyIceberg, enabling efficient access to Apache Iceberg tables without a JVM. It's primarily intended as an internal dependency for the main PyIceberg library but offers performance optimizations for Iceberg data operations. The current version is 0.9.0, and it is actively maintained as part of the broader Apache Iceberg Python project with frequent releases aligning with PyIceberg.
Warnings
- gotcha `pyiceberg-core` is an internal dependency of `pyiceberg`. While it can be installed separately, it is typically managed as an extra by `pyiceberg` (e.g., `pip install "pyiceberg[pyiceberg-core]"`). Directly using `pyiceberg-core` without `pyiceberg` is not the standard pattern and may not expose a full public API.
- gotcha File I/O with object storage (S3, ADLS, GCS) requires installing specific optional dependencies such as `s3fs`, `adlfs`, `gcsfs`, or `pyarrow` (for local filesystem and some cloud storage via PyArrow's filesystem abstractions). Not installing these will lead to runtime errors when attempting to read/write files.
- deprecated The DataFusion integration with PyIceberg (which uses `pyiceberg-core`) is considered experimental and currently has strict version dependencies. For `pyiceberg-core 0.9.0`, it might align with `datafusion == 51`.
- gotcha Building `pyiceberg-core` can require a Rust toolchain on certain architectures (e.g., non-x86_64 or for specific environments), especially if pre-built wheels are not available.
Install
-
pip install pyiceberg-core -
pip install "pyiceberg[pyiceberg-core,pyarrow]"
Imports
- load_catalog
from pyiceberg.catalog import load_catalog
- Schema
from pyiceberg.schema import Schema
- NestedField, StringType, LongType
from pyiceberg.types import NestedField, StringType, LongType
Quickstart
import os
import shutil
import pyarrow as pa
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema
from pyiceberg.types import NestedField, StringType, LongType, IntegerType
# Define a temporary warehouse directory
WAREHOUSE_PATH = "/tmp/pyiceberg_warehouse"
CATALOG_DB_PATH = os.path.join(WAREHOUSE_PATH, "pyiceberg_catalog.db")
# Clean up previous run if exists
if os.path.exists(WAREHOUSE_PATH):
shutil.rmtree(WAREHOUSE_PATH)
os.makedirs(WAREHOUSE_PATH, exist_ok=True)
# Configure and load a local SQL catalog
catalog = load_catalog(
"default",
type="sql",
uri=f"sqlite:///{CATALOG_DB_PATH}",
warehouse=f"file://{WAREHOUSE_PATH}"
)
# Create a namespace (database)
NAMESPACE = "my_namespace"
catalog.create_namespace(NAMESPACE, properties={"comment": "My first Iceberg namespace"})
print(f"Created namespace: {NAMESPACE}")
# Define a schema for the Iceberg table
schema = Schema(
NestedField(1, "id", LongType(), required=True),
NestedField(2, "name", StringType()),
NestedField(3, "age", IntegerType())
)
# Create an Iceberg table
TABLE_NAME = "my_table"
table = catalog.create_table(f"{NAMESPACE}.{TABLE_NAME}", schema, properties={
"format-version": "2",
"write.parquet.compression-codec": "zstd"
})
print(f"Created table: {table.name}")
# Prepare data with PyArrow
data = pa.table({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"age": [30, 24, 35]
})
# Append data to the table
table.append(data)
print("Appended data to the table.")
# Read data from the table
read_df = table.scan().to_arrow()
print("\nData read from Iceberg table:")
print(read_df.to_pandas())
# Clean up
shutil.rmtree(WAREHOUSE_PATH)
print(f"Cleaned up warehouse at {WAREHOUSE_PATH}")