pgzip
pgzip is a Python library that provides a multi-threading implementation of the standard `gzip` module. It aims to be a drop-in replacement, offering significant performance improvements for compression and decompression of large files by leveraging parallel processing. It achieves this by utilizing block indexing within the gzip file's `FEXTRA` field, ensuring compatibility with standard gzip tools. The library is actively maintained, with a recent major release (0.4.0) indicating ongoing development and support for newer Python versions.
Common errors
-
AttributeError: '_io.BufferedReader' object has no attribute '_read_exact'
cause This error was common in `pgzip` versions prior to 0.3.5 when used with Python 3.11, due to internal changes in Python's `gzip` module that `pgzip` relied upon.fixUpgrade `pgzip` to version 0.3.5 or newer: `pip install --upgrade pgzip`. -
RuntimeError: Python version 3.x.y is not supported by pgzip 0.4.0. Requires >=3.10
cause Attempting to use `pgzip` version 0.4.0 or later with an unsupported Python version (e.g., Python 3.7, 3.8, or 3.9).fixUpgrade your Python environment to version 3.10 or newer. Alternatively, if upgrading Python is not feasible, downgrade `pgzip` to a compatible version, such as `pip install pgzip==0.3.5`.
Warnings
- breaking pgzip v0.4.0 dropped support for Python 3.7, 3.8, and 3.9. It now officially supports Python versions 3.10 through 3.14.
- gotcha While `pgzip` is designed for performance, its parallel processing overhead can make it slower than the standard `gzip` module for files or data streams smaller than approximately 1MB.
- gotcha pgzip only replaces specific functions and the `GzipFile` class from the standard `gzip` module (`open()`, `compress()`, `decompress()`). Other `gzip` features, such as `seek()` and `tell()`, might not be fully supported or tested.
Install
-
pip install pgzip
Imports
- pgzip
import pgzip
- open
with pgzip.open('file.gz', 'wb') as f: ... - compress
compressed_data = pgzip.compress(data)
- decompress
decompressed_data = pgzip.decompress(compressed_data)
Quickstart
import pgzip
import os
import tempfile
# Create some dummy data
original_data = b"This is a test string that will be compressed using pgzip. " * 1000
with tempfile.TemporaryDirectory() as tmpdir:
filepath_gz = os.path.join(tmpdir, "test_data.txt.gz")
# 1. Compress data to a file using 4 threads and 1MB blocks
print(f"Compressing data to {filepath_gz}...")
with pgzip.open(filepath_gz, "wb", thread=4, blocksize=2**20) as f_out:
f_out.write(original_data)
print(f"Compressed file size: {os.path.getsize(filepath_gz)} bytes")
# 2. Decompress data from the file using 4 threads
print(f"Decompressing data from {filepath_gz}...")
with pgzip.open(filepath_gz, "rb", thread=4) as f_in:
decompressed_data_file = f_in.read()
assert original_data == decompressed_data_file
print("File compression and decompression successful!")
# 3. In-memory compression and decompression using default threads
print("\nPerforming in-memory compression/decompression...")
compressed_bytes = pgzip.compress(original_data, compresslevel=6)
decompressed_bytes = pgzip.decompress(compressed_bytes)
assert original_data == decompressed_bytes
print(f"In-memory compression/decompression successful! Compressed size: {len(compressed_bytes)} bytes")