PyIceberg
raw JSON → 0.11.1 verified Tue May 12 auth: no python install: verified
PyIceberg is the official Python client for Apache Iceberg, an open table format designed for huge analytic datasets. It provides a pure Pythonic experience, enabling DML operations and queries on Iceberg tables without a JVM, and integrates seamlessly with popular Python data tools like Polars, Pandas, and DuckDB. Currently at version 0.11.1, the library maintains a regular release cadence with minor feature releases and necessary patch updates.
pip install pyiceberg Common errors
error ModuleNotFoundError: No module named 'pyiceberg' ↓
cause The 'pyiceberg' library or a necessary optional dependency is not installed in the current Python environment, or the environment in a cloud service like AWS Glue is not configured to include it.
fix
Install the necessary package using
pip install pyiceberg or with required extras, e.g., pip install 'pyiceberg[s3fs,pyarrow]'. For AWS Glue, use %additional_python_modules pyiceberg within your notebook or job configuration. error pyiceberg.exceptions.NoSuchTableError: Table does not exist: <catalog_name>.<namespace>.<table_name> ↓
cause The specified table identifier does not correspond to an existing table in the configured Iceberg catalog, or there is a configuration issue preventing the catalog from finding it (e.g., incorrect catalog URI, permissions, or issues with concurrent access to in-memory databases).
fix
Double-check the table name and namespace for typos. Ensure the catalog configuration (
type, uri, credentials) is correct and has the necessary permissions. If using an in-memory SQLite catalog for testing, ensure it's configured for a shared cache: jdbc:sqlite:file::memory:?cache=shared. For AWS Glue/Lake Formation, verify correct namespace handling. error AttributeError: 'pydantic_core._pydantic_core.ValidationInfo' object has no attribute 'current_schema_id' ↓
cause This error typically arises from an incompatibility between `pyiceberg` and specific versions of the `pydantic` library (e.g., `pydantic` 2.12.0 or 2.12.1) due to a regression in `pydantic`.
fix
Pin your
pydantic version to avoid the problematic releases. You can downgrade to a version prior to 2.12.0 (e.g., pip install pydantic<2.12.0) or upgrade to a version where the regression is resolved and PyIceberg has adapted (e.g., pip install pydantic>=2.12.3). error TypeError: Summary.init() missing 1 required positional argument: 'operation' ↓
cause The Iceberg metadata file being parsed is non-conformant to the REST API specification, specifically missing the required 'operation' field within snapshot summaries. This can occur if the metadata was generated by other tools that handle this field differently or omit it.
fix
The primary fix involves repairing the source metadata files to include the 'operation' field. If the files are generated by another system (e.g., Snowflake), consult that system's documentation or support to ensure conformant metadata generation.
Warnings
breaking The behavior of `Table.name` changed in version 0.8.1 to return the table name *without* the catalog name. Previously, it might have included the catalog name, which was inconsistent with a broader effort to decouple catalog references. ↓
fix If you relied on `Table.name` to include the catalog identifier, you should adjust your code. Use `Table.identifier` if you need the fully qualified identifier, or reconstruct the full name using the catalog information.
breaking Several AWS-related catalog properties (`profile_name`, `region_name`, `aws_access_key_id`, `aws_secret_access_key`, `aws_session_token`) were deprecated and subsequently removed in version 0.8.0. Unified AWS Credentials should be used instead. ↓
fix Migrate to the unified AWS credentials configuration methods provided by PyIceberg, often leveraging standard AWS environment variables or credentials files, or using new configuration keys like `s3.access-key-id`.
breaking Version 0.11.0 included the removal of several previously deprecated features and APIs. While specific removals were not detailed, users upgrading from older minor versions should review release notes for a comprehensive list of removed APIs. ↓
fix Consult the official release notes and upgrade guides for PyIceberg 0.11.0 to identify and adapt code that used deprecated features.
gotcha PyIceberg is designed for programmatic interaction with Iceberg table metadata and data, but it is not intended for heavy compute or large-scale ETL operations. For such tasks, it's recommended to integrate with dedicated query engines like Spark or Flink. ↓
fix Utilize PyIceberg for metadata management, lightweight analytics, and integrating Iceberg into Python pipelines, while offloading heavy data processing to appropriate distributed compute frameworks.
gotcha While PyIceberg can read data in various formats (Parquet, ORC, Avro), it generally relies on existing Iceberg writers (often from other engines) to create new data files. Direct writing of new files through PyIceberg is not its primary focus. ↓
fix Understand that PyIceberg primarily interacts with Iceberg table *metadata* and existing data. For generating new data files, consider using external Iceberg-compatible writing tools or engines that integrate with PyIceberg for catalog operations.
gotcha Deletion operations in PyIceberg primarily use a Copy-on-Write (CoW) strategy by default, rewriting data files. Merge-on-Read (MoR) deletions are more nuanced and work is ongoing to enhance their efficiency, especially for frequent, small updates. ↓
fix Be aware of the performance implications for delete operations, particularly with the CoW strategy, which can involve rewriting significant portions of data. Monitor for updates on MoR delete enhancements if frequent row-level deletions are critical for your use case.
gotcha Using PyIceberg's SQL catalog functionality (e.g., with `catalog_type='sql'`) requires installing additional dependencies like SQLAlchemy and a specific database driver. These are optional components and are not included in the base `pyiceberg` package by default. ↓
fix Install PyIceberg with the appropriate SQL extra based on your database backend. For example, use `pip install 'pyiceberg[sql-postgres]'` for PostgreSQL or `pip install 'pyiceberg[sql-sqlite]'` for SQLite.
breaking Installation of PyIceberg (specifically its dependency `pyiceberg-core`) on `alpine` Linux distributions can fail due to missing Rust and C toolchain build dependencies. The error 'Error loading shared library libgcc_s.so.1: No such file or directory' indicates a missing C compiler runtime library, essential for building Rust components used by `pyiceberg-core`. ↓
fix Ensure that necessary build tools and libraries are installed on `alpine` Linux environments before attempting to install PyIceberg. This typically includes `gcc` and `libc-dev` (or the `build-base` meta-package). For example, run `apk add gcc libc-dev` or `apk add build-base`.
Install
pip install "pyiceberg[s3fs,pyarrow]" Install compatibility verified last tested: 2026-05-12
python os / libc variant status wheel install import disk
3.10 alpine (musl) s3fs,pyarrow build_error - - - -
3.10 alpine (musl) s3fs,pyarrow - - - -
3.10 alpine (musl) pyiceberg wheel - 1.30s 81.9M
3.10 alpine (musl) pyiceberg - - 1.33s 81.5M
3.10 slim (glibc) s3fs,pyarrow wheel 15.7s 1.01s 321M
3.10 slim (glibc) s3fs,pyarrow - - 0.94s 316M
3.10 slim (glibc) pyiceberg wheel 7.7s 0.91s 89M
3.10 slim (glibc) pyiceberg - - 0.94s 88M
3.11 alpine (musl) s3fs,pyarrow build_error - - - -
3.11 alpine (musl) s3fs,pyarrow - - - -
3.11 alpine (musl) pyiceberg wheel - 1.73s 91.3M
3.11 alpine (musl) pyiceberg - - 2.00s 90.9M
3.11 slim (glibc) s3fs,pyarrow wheel 13.0s 1.58s 334M
3.11 slim (glibc) s3fs,pyarrow - - 1.54s 330M
3.11 slim (glibc) pyiceberg wheel 6.9s 1.57s 98M
3.11 slim (glibc) pyiceberg - - 1.53s 98M
3.12 alpine (musl) s3fs,pyarrow build_error - - - -
3.12 alpine (musl) s3fs,pyarrow - - - -
3.12 alpine (musl) pyiceberg wheel - 1.75s 81.8M
3.12 alpine (musl) pyiceberg - - 1.86s 81.4M
3.12 slim (glibc) s3fs,pyarrow wheel 10.9s 1.69s 324M
3.12 slim (glibc) s3fs,pyarrow - - 1.84s 320M
3.12 slim (glibc) pyiceberg wheel 6.0s 1.70s 88M
3.12 slim (glibc) pyiceberg - - 1.76s 88M
3.13 alpine (musl) s3fs,pyarrow build_error - - - -
3.13 alpine (musl) s3fs,pyarrow - - - -
3.13 alpine (musl) pyiceberg wheel - 1.33s 81.6M
3.13 alpine (musl) pyiceberg - - 1.40s 81.1M
3.13 slim (glibc) s3fs,pyarrow wheel 12.1s 1.35s 386M
3.13 slim (glibc) s3fs,pyarrow - - 1.43s 382M
3.13 slim (glibc) pyiceberg wheel 6.2s 1.30s 88M
3.13 slim (glibc) pyiceberg - - 1.45s 88M
3.9 alpine (musl) s3fs,pyarrow build_error - - - -
3.9 alpine (musl) s3fs,pyarrow - - - -
3.9 alpine (musl) pyiceberg wheel - 1.08s 66.8M
3.9 alpine (musl) pyiceberg - - 1.13s 66.5M
3.9 slim (glibc) s3fs,pyarrow wheel 17.7s 1.00s 345M
3.9 slim (glibc) s3fs,pyarrow - - 0.97s 344M
3.9 slim (glibc) pyiceberg wheel 8.6s 0.93s 65M
3.9 slim (glibc) pyiceberg - - 0.96s 65M
Imports
- load_catalog
from pyiceberg.catalog import load_catalog - Schema
from pyiceberg.schema import Schema - Table wrong
from pyiceberg.catalog.table import Tablecorrectfrom pyiceberg.table import Table
Quickstart last tested: 2026-04-24
import os
from pyiceberg.catalog import load_catalog
from pyiceberg.schema import Schema, NestedField, PrimitiveType
import pyarrow as pa
import datetime
# Configure a local SQL catalog using environment variables or direct properties
# For simplicity, we'll use a temporary local directory.
warehouse_path = '/tmp/pyiceberg_warehouse'
os.makedirs(warehouse_path, exist_ok=True)
catalog = load_catalog(
"default",
type="sql",
uri=f"sqlite:///{warehouse_path}/pyiceberg_catalog.db",
warehouse=f"file://{warehouse_path}"
)
print(f"✓ Successfully loaded catalog: {catalog.name}")
# Define a simple schema
schema = Schema(
NestedField(1, "id", PrimitiveType.long(), required=True),
NestedField(2, "name", PrimitiveType.string(), required=False),
NestedField(3, "event_time", PrimitiveType.timestamp_ntz(), required=False)
)
namespace = "default"
table_name = "my_sample_table"
# Create a namespace if it doesn't exist
catalog.create_namespace_if_not_exists(namespace)
print(f"✓ Ensured namespace '{namespace}' exists.")
# Create a table
try:
table = catalog.create_table(f"{namespace}.{table_name}", schema)
print(f"✓ Successfully created table: {table.identifier}")
except Exception as e:
print(f"Table {namespace}.{table_name} might already exist. Loading it instead. Error: {e}")
table = catalog.load_table(f"{namespace}.{table_name}")
# Prepare some data using PyArrow
data = pa.table({
"id": [1, 2, 3],
"name": ["Alice", "Bob", "Charlie"],
"event_time": [datetime.datetime.now() - datetime.timedelta(days=i) for i in range(3)]
})
# Append data to the table
table.append(data)
print(f"✓ Appended {len(data)} rows to the table.")
# Read data from the table
scan_result = table.scan().to_arrow()
print(f"\nTotal rows read: {len(scan_result)}")
print("Sample data:")
print(scan_result.to_pandas())
# Clean up (optional: uncomment to drop the table and namespace)
# catalog.drop_table(f"{namespace}.{table_name}")
# catalog.drop_namespace(namespace)
# print(f"Cleaned up table {table_name} and namespace {namespace}.")