PyORC
PyORC is a Python module designed for efficiently reading and writing data in the Apache ORC (Optimized Row Columnar) file format. It provides high-performance access to ORC files, commonly used in big data ecosystems like Apache Hive, Spark, and Flink. The current version is 0.11.0, and the library maintains an active development schedule with several releases per year.
Common errors
-
TypeError: object of type '_io.BufferedReader' has no len()
cause Attempting to pass a file-like object directly to the `path` argument of `pyorc.Reader` after version 0.10.0. The `path` argument now expects a string or `pathlib.Path`.fixUse the `file_like_object` keyword argument for file-like objects: `reader = pyorc.Reader(file_like_object=f_obj)`. -
pyorc.errors.PyorcDataError: The given data is not compatible with the current schema
cause The Python tuple/list passed to `writer.write()` does not match the structure or data types specified in the `TypeDescription` provided to the `pyorc.Writer`.fixReview your `TypeDescription` and the data being written. Ensure column order, data types, and (for decimals) precision/scale are correctly aligned. For example, if schema expects `string`, don't pass `int`. -
pyorc.errors.ParseError: Malformed ORC file
cause The ORC file being read is corrupted, truncated, or not a valid ORC file. This can also occur if the file was written with incompatible compression or encoding options.fixVerify the integrity of the ORC file. Ensure it's not truncated. If writing, try different `compression` or `stripe_size` options. If reading, ensure the file is indeed an ORC file.
Warnings
- breaking The `pyorc.Reader` constructor's behavior changed significantly for file-like objects. Previously, you could pass a file-like object directly as the `path` argument. Now, `path` is strictly for string or `pathlib.Path` objects. For file-like objects, you must pass them to the `file_like_object` keyword argument.
- breaking The `TypeKind` enum was moved directly under the `pyorc` module. Additionally, default values for some `Column` attributes (e.g., `tzinfo`) were changed.
- gotcha When writing `datetime` objects to ORC `timestamp` columns, it is highly recommended to use timezone-aware `datetime` objects (e.g., using `datetime.timezone.utc` or `pytz`). Writing naive `datetime` objects can lead to ambiguous or incorrect time interpretations in downstream systems.
- gotcha PyORC enforces strict schema matching during writing. If the Python data types or structure do not align precisely with the defined ORC `TypeDescription`, a `PyorcDataError` will be raised. This includes precision and scale for `decimal` types.
Install
-
pip install pyorc -
pip install 'pyorc[dataframe]'
Imports
- Reader
from pyorc import Reader
- Writer
from pyorc import Writer
- TypeDescription
from pyorc import TypeDescription
- TypeKind
from pyorc.enums import TypeKind
from pyorc import TypeKind
Quickstart
import pyorc
import os
import datetime
import decimal
# Define a schema for demonstration
schema_str = "struct<id:int,name:string,value:decimal(10,2),timestamp:timestamp>"
schema = pyorc.TypeDescription.from_string(schema_str)
file_path = "example.orc"
# --- Writing an ORC file ---
print(f"Writing to {file_path}")
with open(file_path, "wb") as f:
with pyorc.Writer(f, schema) as writer:
writer.write((1, "Alice", decimal.Decimal("10.50"), datetime.datetime(2023, 1, 1, 10, 0, 0, tzinfo=datetime.timezone.utc)))
writer.write((2, "Bob", decimal.Decimal("20.75"), datetime.datetime(2023, 1, 2, 11, 30, 0, tzinfo=datetime.timezone.utc)))
writer.write((3, "Charlie", decimal.Decimal("30.00"), datetime.datetime(2023, 1, 3, 12, 0, 0, tzinfo=datetime.timezone.utc)))
print(f"Successfully wrote {file_path}")
# --- Reading an ORC file ---
print(f"Reading from {file_path}")
with open(file_path, "rb") as f:
# For file-like objects, pass them directly to the Reader constructor (v0.10.0+)
reader = pyorc.Reader(f)
print("Schema:", reader.schema)
print("Rows:")
for row in reader:
print(row)
# Clean up
if os.path.exists(file_path):
os.remove(file_path)
print(f"Cleaned up {file_path}")