Embedding Reader
embedding-reader is a Python library designed for efficiently reading embeddings from various file formats, including HDF5, Parquet, JSON, TSV, and NumPy with memory mapping (mmap). It also provides functionality to download embeddings from Hugging Face datasets. The library focuses on performance and ease of use for large-scale embedding datasets. The current version is 1.8.1, and it maintains a regular release cadence, often with monthly or bi-monthly updates.
Common errors
-
ModuleNotFoundError: No module named 'embedding_reader'
cause The `embedding-reader` package is either not installed or installed in a different environment.fixEnsure the package is installed in your current Python environment using `pip install embedding-reader`. -
FileNotFoundError: [Errno 2] No such file or directory: 'your/path/to/embeddings.h5'
cause The path provided to `EmbeddingReader` does not point to an existing file or directory.fixDouble-check the file path. Ensure it's absolute or relative to your current working directory. If it's a remote path, confirm network access and correct URI (e.g., `s3://`, `hf://`). -
ValueError: Unknown format for file: 'embeddings.dat'
cause The library could not infer the file format from the extension, or the extension is not supported.fixRename the file with a supported extension (e.g., `.h5`, `.parquet`, `.npy`) or explicitly specify the `file_format` parameter during `EmbeddingReader` initialization (e.g., `EmbeddingReader(path='embeddings.dat', file_format='hdf5')`). -
KeyError: 'embeddings' (when reading HDF5/Parquet)
cause The specified `embedding_column_name` or `id_column_name` (defaults are 'embedding' and 'id') does not exist in the source file.fixInspect your HDF5/Parquet file to confirm the actual column names. Then, pass the correct names to the `EmbeddingReader` constructor: `EmbeddingReader(..., embedding_column_name='my_embeddings_key', id_column_name='doc_id')`.
Warnings
- gotcha When reading from Hugging Face, ensure you use the `hf://` prefix (e.g., `hf://organisation/dataset_name`) and provide `dataset_name` and `subfolder` parameters correctly if the dataset is not at the root or has multiple configurations.
- breaking In versions 1.7.0+, new `dataset_name` and `subfolder` parameters were introduced for better Hugging Face integration. Code relying on implicit path resolution for HF datasets might need adjustment.
- gotcha The library infers the file format from the file extension. If your file has a non-standard extension or no extension, you might need to explicitly specify the `file_format` parameter to `EmbeddingReader`.
- gotcha For very large NumPy (`.npy`) embedding files, enabling `mmap_mode='r'` in the `EmbeddingReader` constructor can significantly reduce memory usage by memory-mapping the file instead of loading it entirely into RAM.
Install
-
pip install embedding-reader
Imports
- EmbeddingReader
from embedding_reader.embedding_reader import EmbeddingReader
from embedding_reader import EmbeddingReader
Quickstart
import numpy as np
import os
from embedding_reader import EmbeddingReader
# Create a dummy HDF5 file for demonstration
# In a real scenario, you would point to an existing file or HF dataset
dummy_h5_path = "./dummy_embeddings.h5"
if not os.path.exists(dummy_h5_path):
import h5py
with h5py.File(dummy_h5_path, 'w') as f:
f.create_dataset('embeddings', data=np.random.rand(10, 5))
f.create_dataset('ids', data=np.arange(10).astype(str))
# Initialize the EmbeddingReader
# For Hugging Face datasets, use path='hf://organization/dataset_name'
reader = EmbeddingReader(
path=dummy_h5_path,
embedding_column_name='embeddings',
id_column_name='ids'
)
# Read embeddings and IDs
embeddings = reader.read_embeddings()
ids = reader.read_ids()
print(f"Read {len(embeddings)} embeddings of shape {embeddings.shape[1]}")
print(f"First 3 IDs: {ids[:3]}")
# Clean up dummy file
os.remove(dummy_h5_path)