Embedding Reader

1.8.1 · active · verified Fri Apr 17

embedding-reader is a Python library designed for efficiently reading embeddings from various file formats, including HDF5, Parquet, JSON, TSV, and NumPy with memory mapping (mmap). It also provides functionality to download embeddings from Hugging Face datasets. The library focuses on performance and ease of use for large-scale embedding datasets. The current version is 1.8.1, and it maintains a regular release cadence, often with monthly or bi-monthly updates.

Common errors

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize the `EmbeddingReader` and read embeddings and their corresponding IDs from a file. It includes a setup for a dummy HDF5 file to make the example runnable. For actual use, replace `dummy_h5_path` with your local file path or a Hugging Face dataset identifier (e.g., `hf://facebook/dolly-v2-7b-embeddings`). Remember to specify the `embedding_column_name` and `id_column_name` if they differ from the defaults (`embedding` and `id` respectively).

import numpy as np
import os
from embedding_reader import EmbeddingReader

# Create a dummy HDF5 file for demonstration
# In a real scenario, you would point to an existing file or HF dataset
dummy_h5_path = "./dummy_embeddings.h5"
if not os.path.exists(dummy_h5_path):
    import h5py
    with h5py.File(dummy_h5_path, 'w') as f:
        f.create_dataset('embeddings', data=np.random.rand(10, 5))
        f.create_dataset('ids', data=np.arange(10).astype(str))

# Initialize the EmbeddingReader
# For Hugging Face datasets, use path='hf://organization/dataset_name'
reader = EmbeddingReader(
    path=dummy_h5_path, 
    embedding_column_name='embeddings', 
    id_column_name='ids'
)

# Read embeddings and IDs
embeddings = reader.read_embeddings()
ids = reader.read_ids()

print(f"Read {len(embeddings)} embeddings of shape {embeddings.shape[1]}")
print(f"First 3 IDs: {ids[:3]}")

# Clean up dummy file
os.remove(dummy_h5_path)

view raw JSON →