Petastorm
Petastorm is a Python library that enables single-node or distributed training of machine learning models directly from datasets stored in Parquet format. It provides data access for popular frameworks like TensorFlow, PyTorch, and Apache Spark. The current stable version is 0.13.1, with releases typically following a feature-driven cadence, often including release candidates before stable versions.
Common errors
-
ModuleNotFoundError: No module named 'petastorm.spark'
cause The `petastorm.spark` module is only available if Petastorm was installed with the `spark` extra and `pyspark` is installed.fixInstall Petastorm with the Spark extra: `pip install petastorm[spark]` and ensure `pyspark` is also installed in your environment. -
FileNotFoundError: [Errno 2] No such file or directory: 'file:///path/to/my_dataset'
cause The specified dataset URL or path for `make_reader` or `make_writer` does not exist or is inaccessible. This often happens with incorrect paths, network drive issues, or missing data.fixDouble-check the `dataset_url` to ensure it's a valid path to an existing dataset directory (or a directory where you intend to write data). For HDFS/S3, ensure proper authentication and client setup. -
Pyarrow.lib.ArrowInvalid: Could not convert ...
cause This error typically indicates a data type or schema mismatch when writing or reading data. The data being processed doesn't conform to the `Unischema` or expected Parquet types.fixVerify that the `Unischema` definition correctly matches the actual data types and shapes you are writing. For reading, ensure the schema used by Petastorm aligns with the schema of the Parquet files.
Warnings
- breaking The default `reader_pool_type` for `make_reader` changed from 'thread' to 'process' in Petastorm v0.13.0. This can cause issues if your data contains objects that are not picklable, or if you expect thread-based concurrency.
- deprecated The `PetastormDataset` class (e.g., from `petastorm.reader`) and direct instantiation of `Reader` were deprecated in favor of the `make_reader` factory function.
- gotcha Using Petastorm with TensorFlow or PyTorch requires installing the corresponding 'extras' (e.g., `pip install petastorm[tensorflow]`). Without these, you might miss framework-specific utilities or experience integration issues.
- gotcha When using `make_reader` with Spark, ensure `pyspark` is installed and the `spark` extra is included during Petastorm installation. Otherwise, Spark-specific modules will be missing.
Install
-
pip install petastorm -
pip install petastorm[tensorflow] -
pip install petastorm[pytorch] -
pip install petastorm[spark]
Imports
- make_reader
from petastorm import make_reader
- make_writer
from petastorm import make_writer
- Unischema
from petastorm.unischema import Unischema
- DataLoader (PyTorch)
from petastorm.pytorch import DataLoader
- SparkDatasetConverter
from petastorm.etl.spark_dataset_converter import SparkDatasetConverter
from petastorm.spark import SparkDatasetConverter
- Reader (direct import)
from petastorm.reader import Reader
from petastorm import make_reader
Quickstart
import os
import shutil
import numpy as np
from petastorm import make_reader, make_writer
from petastorm.unischema import Unischema, UnischemaField, ScalarCodec
from petastorm.codecs import CompressedNdarrayCodec
# 1. Define a schema for your data
MySchema = Unischema(
'MySchema',
[
UnischemaField('id', np.int32, (), ScalarCodec(np.int32), False),
UnischemaField('value', np.float64, (), ScalarCodec(np.float64), False),
UnischemaField('image', np.uint8, (10, 10, 3), CompressedNdarrayCodec(), False),
]
)
# 2. Define a dataset path (using a temporary local directory for example)
dataset_url = 'file:///tmp/petastorm_example_data'
# Clean up previous data if it exists
if os.path.exists('/tmp/petastorm_example_data'):
shutil.rmtree('/tmp/petastorm_example_data')
# 3. Write some dummy data to the Parquet dataset
print(f"Writing dummy data to {dataset_url}...")
with make_writer(dataset_url, MySchema, row_group_size_bytes=2 * 1024 * 1024) as writer:
for i in range(10):
writer.write(
MySchema.make_row(
id=i,
value=float(i * 10),
image=np.random.randint(0, 256, size=(10, 10, 3), dtype=np.uint8)
)
)
print(f"Successfully wrote 10 rows.")
# 4. Read data using make_reader
# reader_pool_type='thread' is often suitable for local development.
# For production, 'process' might be preferred depending on data access patterns.
print("\nReading data from the dataset:")
with make_reader(dataset_url, reader_pool_type='thread', num_epochs=1) as reader:
for i, row in enumerate(reader):
print(f"Row {i}: id={row.id}, value={row.value}, image_shape={row.image.shape}")
if i >= 2: # Print only a few rows for brevity
break
print("Finished reading example data.")
# Clean up the temporary dataset
shutil.rmtree('/tmp/petastorm_example_data')