{"id":9722,"library":"embedding-reader","title":"Embedding Reader","description":"embedding-reader is a Python library designed for efficiently reading embeddings from various file formats, including HDF5, Parquet, JSON, TSV, and NumPy with memory mapping (mmap). It also provides functionality to download embeddings from Hugging Face datasets. The library focuses on performance and ease of use for large-scale embedding datasets. The current version is 1.8.1, and it maintains a regular release cadence, often with monthly or bi-monthly updates.","status":"active","version":"1.8.1","language":"en","source_language":"en","source_url":"https://github.com/rom1504/embedding-reader","tags":["embeddings","data-loading","vector-databases","ai","machine-learning","hdf5","parquet","numpy"],"install":[{"cmd":"pip install embedding-reader","lang":"bash","label":"Install core library"}],"dependencies":[{"reason":"Core for numerical operations and array handling.","package":"numpy"},{"reason":"Required for reading HDF5 format embedding files.","package":"h5py"},{"reason":"Used for reading Parquet, JSON, and TSV formats, particularly for structured data.","package":"pandas"},{"reason":"Underpins Parquet file reading, often used with Pandas.","package":"pyarrow"},{"reason":"Abstracts away filesystem details, enabling reading from local, S3, GCS, HF Hub, etc.","package":"fsspec"},{"reason":"Provides progress bars for operations, especially during large file reads or downloads.","package":"tqdm"}],"imports":[{"note":"The main class is directly importable from the top-level package.","wrong":"from embedding_reader.embedding_reader import EmbeddingReader","symbol":"EmbeddingReader","correct":"from embedding_reader import EmbeddingReader"}],"quickstart":{"code":"import numpy as np\nimport os\nfrom embedding_reader import EmbeddingReader\n\n# Create a dummy HDF5 file for demonstration\n# In a real scenario, you would point to an existing file or HF dataset\ndummy_h5_path = \"./dummy_embeddings.h5\"\nif not os.path.exists(dummy_h5_path):\n    import h5py\n    with h5py.File(dummy_h5_path, 'w') as f:\n        f.create_dataset('embeddings', data=np.random.rand(10, 5))\n        f.create_dataset('ids', data=np.arange(10).astype(str))\n\n# Initialize the EmbeddingReader\n# For Hugging Face datasets, use path='hf://organization/dataset_name'\nreader = EmbeddingReader(\n    path=dummy_h5_path, \n    embedding_column_name='embeddings', \n    id_column_name='ids'\n)\n\n# Read embeddings and IDs\nembeddings = reader.read_embeddings()\nids = reader.read_ids()\n\nprint(f\"Read {len(embeddings)} embeddings of shape {embeddings.shape[1]}\")\nprint(f\"First 3 IDs: {ids[:3]}\")\n\n# Clean up dummy file\nos.remove(dummy_h5_path)","lang":"python","description":"This quickstart demonstrates how to initialize the `EmbeddingReader` and read embeddings and their corresponding IDs from a file. It includes a setup for a dummy HDF5 file to make the example runnable. For actual use, replace `dummy_h5_path` with your local file path or a Hugging Face dataset identifier (e.g., `hf://facebook/dolly-v2-7b-embeddings`). Remember to specify the `embedding_column_name` and `id_column_name` if they differ from the defaults (`embedding` and `id` respectively)."},"warnings":[{"fix":"Always prepend `hf://` for Hugging Face paths. For multi-config datasets, specify `dataset_name` and `subfolder` during `EmbeddingReader` initialization or ensure `path` points to a specific config.","message":"When reading from Hugging Face, ensure you use the `hf://` prefix (e.g., `hf://organisation/dataset_name`) and provide `dataset_name` and `subfolder` parameters correctly if the dataset is not at the root or has multiple configurations.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Explicitly pass `dataset_name` and `subfolder` parameters to the `EmbeddingReader` constructor when dealing with Hugging Face datasets, especially those with multiple configurations or nested structures.","message":"In versions 1.7.0+, new `dataset_name` and `subfolder` parameters were introduced for better Hugging Face integration. Code relying on implicit path resolution for HF datasets might need adjustment.","severity":"breaking","affected_versions":">=1.7.0"},{"fix":"If automatic format detection fails, provide `file_format='hdf5'`, `file_format='parquet'`, etc., in the `EmbeddingReader` constructor.","message":"The library infers the file format from the file extension. If your file has a non-standard extension or no extension, you might need to explicitly specify the `file_format` parameter to `EmbeddingReader`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"When dealing with large .npy files, initialize `EmbeddingReader` with `mmap_mode='r'` (e.g., `EmbeddingReader(path='file.npy', mmap_mode='r')`).","message":"For very large NumPy (`.npy`) embedding files, enabling `mmap_mode='r'` in the `EmbeddingReader` constructor can significantly reduce memory usage by memory-mapping the file instead of loading it entirely into RAM.","severity":"gotcha","affected_versions":">=1.6.0"}],"env_vars":null,"last_verified":"2026-04-17T00:00:00.000Z","next_check":"2026-07-16T00:00:00.000Z","problems":[{"fix":"Ensure the package is installed in your current Python environment using `pip install embedding-reader`.","cause":"The `embedding-reader` package is either not installed or installed in a different environment.","error":"ModuleNotFoundError: No module named 'embedding_reader'"},{"fix":"Double-check the file path. Ensure it's absolute or relative to your current working directory. If it's a remote path, confirm network access and correct URI (e.g., `s3://`, `hf://`).","cause":"The path provided to `EmbeddingReader` does not point to an existing file or directory.","error":"FileNotFoundError: [Errno 2] No such file or directory: 'your/path/to/embeddings.h5'"},{"fix":"Rename the file with a supported extension (e.g., `.h5`, `.parquet`, `.npy`) or explicitly specify the `file_format` parameter during `EmbeddingReader` initialization (e.g., `EmbeddingReader(path='embeddings.dat', file_format='hdf5')`).","cause":"The library could not infer the file format from the extension, or the extension is not supported.","error":"ValueError: Unknown format for file: 'embeddings.dat'"},{"fix":"Inspect your HDF5/Parquet file to confirm the actual column names. Then, pass the correct names to the `EmbeddingReader` constructor: `EmbeddingReader(..., embedding_column_name='my_embeddings_key', id_column_name='doc_id')`.","cause":"The specified `embedding_column_name` or `id_column_name` (defaults are 'embedding' and 'id') does not exist in the source file.","error":"KeyError: 'embeddings' (when reading HDF5/Parquet)"}]}