Indexed Gzip
The `indexed-gzip` project is a Python extension providing fast random access to gzip files by building an index of seek points. It acts as a drop-in replacement for Python's built-in `gzip.GzipFile` class, significantly improving performance for `seek` operations on large gzipped files. It is currently at version 1.10.3 and is actively maintained with regular releases.
Common errors
-
IOError: No write support for IndexedGzipFile
cause Attempting to open an `IndexedGzipFile` in write mode ('w', 'wb', 'a', etc.) or calling write methods on it.fixThe `IndexedGzipFile` is strictly read-only. For writing gzipped files, use Python's built-in `gzip` module (e.g., `gzip.GzipFile('file.gz', 'wb')`) or another appropriate library. Once written, the file can be opened with `indexed-gzip` for efficient random access. -
TypeError: argument of type 'Path' is not iterable
cause Passing a `pathlib.Path` object directly as the `filename` argument to `IndexedGzipFile` in an older version (<1.10.2) that did not explicitly support `pathlib.Path`.fixUpgrade `indexed-gzip` to version 1.10.2 or later. Alternatively, convert the `pathlib.Path` object to a string before passing it: `igzip.IndexedGzipFile(str(my_path_obj))`. -
Extremely slow seek() operations when using `gzip.GzipFile` on large files.
cause The standard `gzip.GzipFile` class must decompress from the beginning of the file up to the desired seek point, making random access inefficient, especially for large files.fixReplace `gzip.GzipFile` with `indexed_gzip.IndexedGzipFile`. This library builds an internal index allowing for much faster random `seek` operations. `import indexed_gzip as igzip` and use `igzip.IndexedGzipFile` instead of `gzip.GzipFile`.
Warnings
- breaking The `IndexedGzipFile` class currently does not support writing data. It is a read-only interface. Attempting to open in write mode or call write methods will result in an error.
- gotcha The `spacing` parameter during `IndexedGzipFile` initialization (or implicitly during index building) controls the density of seek points. A smaller `spacing` improves seek performance but increases memory usage for the index, and vice-versa. The default is 1MB.
- deprecated Prior to version 1.10.2, passing `pathlib.Path` objects directly to `IndexedGzipFile` for the filename argument might not have been fully supported, potentially leading to errors or unexpected behavior.
- gotcha A bug in versions prior to 1.10.0 could occur when CRC validation was disabled, particularly on GZIP streams where the stream footer contained bytes matching the GZIP magic bytes `0x1f8b`.
Install
-
pip install indexed-gzip
Imports
- IndexedGzipFile
from indexed_gzip import IndexedGzipFile
Quickstart
import indexed_gzip as igzip
import os
# Create a dummy gzip file for demonstration
dummy_data = b"This is some sample data for a gzipped file.\nRepeat this line many times to make it bigger.\n" * 10000
with open('test_file.gz', 'wb') as f:
import gzip
g = gzip.GzipFile(fileobj=f, mode='wb')
g.write(dummy_data)
g.close()
# Open the indexed gzip file
try:
with igzip.IndexedGzipFile('test_file.gz') as fobj:
print(f"Original file size: {len(dummy_data)} bytes")
# Seek to an arbitrary position
fobj.seek(15000)
data = fobj.read(100)
print(f"Read 100 bytes from offset 15000: {data.decode(errors='ignore')[:50]}...")
# Seek to another position
fobj.seek(5000)
data = fobj.read(50)
print(f"Read 50 bytes from offset 5000: {data.decode(errors='ignore')[:50]}...")
# Build a full index explicitly (optional, often done on demand)
fobj.build_full_index()
print(f"Index built with {fobj.tell()} bytes processed.")
finally:
# Clean up the dummy file
if os.path.exists('test_file.gz'):
os.remove('test_file.gz')