Parquet
The `parquet` library (parquet-python) is a pure-Python implementation for working with the Apache Parquet file format. As of its last update (version 1.3.1), it primarily offers read-only support for Parquet files, allowing users to extract data as JSON or TSV. The project explicitly states that performance has not been optimized and many features, including writing, are not implemented. Development appears to have ceased in 2017 on GitHub and the last PyPI upload was in 2020, indicating it is an unmaintained project.
Warnings
- breaking The `parquet` library (parquet-python) is explicitly a read-only implementation of the Parquet format; it does not support writing Parquet files.
- gotcha This library is largely unmaintained and has not seen significant development since 2017 (GitHub) / 2020 (PyPI). Many features of the Parquet format, including nested data, are not fully implemented or tested, and performance is explicitly stated as 'not yet optimized'.
- deprecated The library officially supports Python 2.7, 3.6, and 3.7. Compatibility with newer Python versions (3.8+) is not guaranteed and unlikely to be addressed due to the project's abandonment.
- gotcha The project is labeled with a 'Development Status :: 3 - Alpha' on PyPI, indicating it is an unstable and experimental project, despite its age.
- gotcha The `fastparquet` library was forked from `parquet-python` in 2016 specifically because `parquet-python` was 'not designed for vectorised loading of big data or parallel access,' highlighting its performance limitations for large-scale data.
Install
-
pip install parquet -
pip install 'parquet[snappy]'
Imports
- DictReader
from parquet import DictReader
import parquet parquet.DictReader(...)
- reader
import parquet parquet.reader(...)
Quickstart
import parquet
import json
import os
# Create a dummy Parquet file for demonstration
# This library only supports reading, so we'll simulate a file.
# In a real scenario, you'd have an existing .parquet file.
# For demonstration, we'll write a simple text file
# and ask the user to manually create a test.parquet file
# since this library does not support writing.
# You would replace 'test.parquet' with your actual file.
print("This library is read-only. Please ensure 'test.parquet' exists.")
print("Example content (replace with actual Parquet data):")
print("## foo bar baz\n## 1 2 3\n## 4 5 6")
# Assuming a 'test.parquet' file exists with data:
# {'foo': 1, 'bar': 2, 'baz': 3}
# {'foo': 4, 'bar': 5, 'baz': 6}
try:
with open("test.parquet", "rb") as fo:
print("\nReading 'test.parquet' with DictReader (columns 'foo', 'bar'):")
for row in parquet.DictReader(fo, columns=['foo', 'bar']):
print(json.dumps(row))
with open("test.parquet", "rb") as fo:
print("\nReading 'test.parquet' with reader (columns 'foo', 'bar'):")
for row in parquet.reader(fo, columns=['foo', 'bar']):
print(",".join([str(r) for r in row]))
except FileNotFoundError:
print("Error: 'test.parquet' not found. Please create one for testing.")
except Exception as e:
print(f"An error occurred: {e}")