{"id":7497,"library":"petastorm","title":"Petastorm","description":"Petastorm is a Python library that enables single-node or distributed training of machine learning models directly from datasets stored in Parquet format. It provides data access for popular frameworks like TensorFlow, PyTorch, and Apache Spark. The current stable version is 0.13.1, with releases typically following a feature-driven cadence, often including release candidates before stable versions.","status":"active","version":"0.13.1","language":"en","source_language":"en","source_url":"https://github.com/uber/petastorm","tags":["data processing","machine learning","tensorflow","pytorch","parquet","dataloader","spark"],"install":[{"cmd":"pip install petastorm","lang":"bash","label":"Base installation"},{"cmd":"pip install petastorm[tensorflow]","lang":"bash","label":"For TensorFlow integration"},{"cmd":"pip install petastorm[pytorch]","lang":"bash","label":"For PyTorch integration"},{"cmd":"pip install petastorm[spark]","lang":"bash","label":"For Apache Spark integration"}],"dependencies":[{"reason":"Core dependency for array handling.","package":"numpy","optional":false},{"reason":"Required for Parquet file I/O operations.","package":"pyarrow","optional":false},{"reason":"Often used for data manipulation before writing to Parquet or after reading.","package":"pandas","optional":true},{"reason":"Required for petastorm.spark module when integrating with Apache Spark.","package":"pyspark","optional":true}],"imports":[{"symbol":"make_reader","correct":"from petastorm import make_reader"},{"symbol":"make_writer","correct":"from petastorm import make_writer"},{"symbol":"Unischema","correct":"from petastorm.unischema import Unischema"},{"symbol":"DataLoader (PyTorch)","correct":"from petastorm.pytorch import DataLoader"},{"note":"The `spark` module was moved directly under `petastorm` for easier access as of v0.10.0.","wrong":"from petastorm.etl.spark_dataset_converter import SparkDatasetConverter","symbol":"SparkDatasetConverter","correct":"from petastorm.spark import SparkDatasetConverter"},{"note":"`make_reader` is the recommended factory function for creating readers and managing resources, especially since `Reader` from `petastorm.reader` was deprecated/removed in v0.13.0.","wrong":"from petastorm.reader import Reader","symbol":"Reader (direct import)","correct":"from petastorm import make_reader"}],"quickstart":{"code":"import os\nimport shutil\nimport numpy as np\nfrom petastorm import make_reader, make_writer\nfrom petastorm.unischema import Unischema, UnischemaField, ScalarCodec\nfrom petastorm.codecs import CompressedNdarrayCodec\n\n# 1. Define a schema for your data\nMySchema = Unischema(\n    'MySchema',\n    [\n        UnischemaField('id', np.int32, (), ScalarCodec(np.int32), False),\n        UnischemaField('value', np.float64, (), ScalarCodec(np.float64), False),\n        UnischemaField('image', np.uint8, (10, 10, 3), CompressedNdarrayCodec(), False),\n    ]\n)\n\n# 2. Define a dataset path (using a temporary local directory for example)\ndataset_url = 'file:///tmp/petastorm_example_data'\n# Clean up previous data if it exists\nif os.path.exists('/tmp/petastorm_example_data'):\n    shutil.rmtree('/tmp/petastorm_example_data')\n\n# 3. Write some dummy data to the Parquet dataset\nprint(f\"Writing dummy data to {dataset_url}...\")\nwith make_writer(dataset_url, MySchema, row_group_size_bytes=2 * 1024 * 1024) as writer:\n    for i in range(10):\n        writer.write(\n            MySchema.make_row(\n                id=i,\n                value=float(i * 10),\n                image=np.random.randint(0, 256, size=(10, 10, 3), dtype=np.uint8)\n            )\n        )\nprint(f\"Successfully wrote 10 rows.\")\n\n# 4. Read data using make_reader\n# reader_pool_type='thread' is often suitable for local development.\n# For production, 'process' might be preferred depending on data access patterns.\nprint(\"\\nReading data from the dataset:\")\nwith make_reader(dataset_url, reader_pool_type='thread', num_epochs=1) as reader:\n    for i, row in enumerate(reader):\n        print(f\"Row {i}: id={row.id}, value={row.value}, image_shape={row.image.shape}\")\n        if i >= 2: # Print only a few rows for brevity\n            break\nprint(\"Finished reading example data.\")\n\n# Clean up the temporary dataset\nshutil.rmtree('/tmp/petastorm_example_data')\n","lang":"python","description":"This quickstart demonstrates how to define a data schema, write sample data to a Parquet dataset using `make_writer`, and then read it back using `make_reader`. The example cleans up the temporary directory after execution. For real-world usage, consider configuring `reader_pool_type` and `num_epochs` based on your training requirements."},"warnings":[{"fix":"If you encounter `TypeError: cannot pickle ...` or unexpected performance, explicitly set `reader_pool_type='thread'` in your `make_reader` call: `make_reader(..., reader_pool_type='thread', ...)`.","message":"The default `reader_pool_type` for `make_reader` changed from 'thread' to 'process' in Petastorm v0.13.0. This can cause issues if your data contains objects that are not picklable, or if you expect thread-based concurrency.","severity":"breaking","affected_versions":">=0.13.0"},{"fix":"Always use `from petastorm import make_reader` and instantiate readers via `make_reader(...)` for future compatibility and resource management.","message":"The `PetastormDataset` class (e.g., from `petastorm.reader`) and direct instantiation of `Reader` were deprecated in favor of the `make_reader` factory function.","severity":"deprecated","affected_versions":">=0.13.0"},{"fix":"Ensure you install Petastorm with the relevant extras for your ML framework: `pip install petastorm[tensorflow]` or `pip install petastorm[pytorch]`.","message":"Using Petastorm with TensorFlow or PyTorch requires installing the corresponding 'extras' (e.g., `pip install petastorm[tensorflow]`). Without these, you might miss framework-specific utilities or experience integration issues.","severity":"gotcha","affected_versions":"All"},{"fix":"Install `pyspark` and Petastorm with the Spark extra: `pip install pyspark petastorm[spark]`.","message":"When using `make_reader` with Spark, ensure `pyspark` is installed and the `spark` extra is included during Petastorm installation. Otherwise, Spark-specific modules will be missing.","severity":"gotcha","affected_versions":"All"}],"env_vars":null,"last_verified":"2026-04-16T00:00:00.000Z","next_check":"2026-07-15T00:00:00.000Z","problems":[{"fix":"Install Petastorm with the Spark extra: `pip install petastorm[spark]` and ensure `pyspark` is also installed in your environment.","cause":"The `petastorm.spark` module is only available if Petastorm was installed with the `spark` extra and `pyspark` is installed.","error":"ModuleNotFoundError: No module named 'petastorm.spark'"},{"fix":"Double-check the `dataset_url` to ensure it's a valid path to an existing dataset directory (or a directory where you intend to write data). For HDFS/S3, ensure proper authentication and client setup.","cause":"The specified dataset URL or path for `make_reader` or `make_writer` does not exist or is inaccessible. This often happens with incorrect paths, network drive issues, or missing data.","error":"FileNotFoundError: [Errno 2] No such file or directory: 'file:///path/to/my_dataset'"},{"fix":"Verify that the `Unischema` definition correctly matches the actual data types and shapes you are writing. For reading, ensure the schema used by Petastorm aligns with the schema of the Parquet files.","cause":"This error typically indicates a data type or schema mismatch when writing or reading data. The data being processed doesn't conform to the `Unischema` or expected Parquet types.","error":"Pyarrow.lib.ArrowInvalid: Could not convert ..."}]}