PyORC

0.11.0 · active · verified Thu Apr 16

PyORC is a Python module designed for efficiently reading and writing data in the Apache ORC (Optimized Row Columnar) file format. It provides high-performance access to ORC files, commonly used in big data ecosystems like Apache Hive, Spark, and Flink. The current version is 0.11.0, and the library maintains an active development schedule with several releases per year.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to define an ORC schema, write data into an ORC file, and then read the data back using `pyorc.Writer` and `pyorc.Reader`. It includes examples for `int`, `string`, `decimal`, and `timestamp` types, ensuring timezone awareness for `datetime` objects.

import pyorc
import os
import datetime
import decimal

# Define a schema for demonstration
schema_str = "struct<id:int,name:string,value:decimal(10,2),timestamp:timestamp>"
schema = pyorc.TypeDescription.from_string(schema_str)

file_path = "example.orc"

# --- Writing an ORC file ---
print(f"Writing to {file_path}")
with open(file_path, "wb") as f:
    with pyorc.Writer(f, schema) as writer:
        writer.write((1, "Alice", decimal.Decimal("10.50"), datetime.datetime(2023, 1, 1, 10, 0, 0, tzinfo=datetime.timezone.utc)))
        writer.write((2, "Bob", decimal.Decimal("20.75"), datetime.datetime(2023, 1, 2, 11, 30, 0, tzinfo=datetime.timezone.utc)))
        writer.write((3, "Charlie", decimal.Decimal("30.00"), datetime.datetime(2023, 1, 3, 12, 0, 0, tzinfo=datetime.timezone.utc)))

print(f"Successfully wrote {file_path}")

# --- Reading an ORC file ---
print(f"Reading from {file_path}")
with open(file_path, "rb") as f:
    # For file-like objects, pass them directly to the Reader constructor (v0.10.0+)
    reader = pyorc.Reader(f)
    print("Schema:", reader.schema)
    print("Rows:")
    for row in reader:
        print(row)

# Clean up
if os.path.exists(file_path):
    os.remove(file_path)
    print(f"Cleaned up {file_path}")

view raw JSON →