Annoy (Approximate Nearest Neighbors)
Annoy (Approximate Nearest Neighbors Oh Yeah) is a C++ library with Python bindings designed for efficient similarity search in high-dimensional spaces. It's optimized for memory usage and can create large, read-only, file-based data structures that are memory-mapped, enabling multiple processes to share the same index. The library is actively maintained by Spotify with frequent minor releases.
Warnings
- gotcha Once the `build()` method is called on an `AnnoyIndex` instance, no more items can be added to that index. Annoy is designed for static, read-only indexes after creation. If you need a mutable index, consider rebuilding or using an alternative library.
- gotcha Item IDs must be non-negative integers. Annoy allocates memory for `max(id)+1` items, assuming dense integer IDs from 0 to N-1. Using sparse or very large IDs can lead to excessive memory allocation or unexpected behavior.
- gotcha The `n_trees` parameter (during build) affects build time and index size; higher values give better accuracy but larger indexes. The `search_k` parameter (during search) affects search time; higher values give better accuracy but longer search times. You must tune these parameters for your specific accuracy and performance needs.
- deprecated Older versions (prior to 1.17.2) were known to have memory leaks, especially during index building or repeated operations.
- breaking Version 1.16.1 introduced stricter checks, preventing saving an index that hasn't been built or building an index that has already been built.
- gotcha Compilation issues have occurred on specific platforms, such as OS X (fixed in 1.17.3) and certain GCC versions with AVX instructions (fixed in 1.16.1). These can prevent successful installation or lead to runtime errors.
Install
-
pip install annoy
Imports
- AnnoyIndex
from annoy import AnnoyIndex
Quickstart
import os
from annoy import AnnoyIndex
import random
f = 40 # Length of item vector that will be indexed
t = AnnoyIndex(f, 'euclidean') # or 'angular', 'manhattan', 'hamming', 'dot'
# Add items to the index
for i in range(1000):
v = [random.gauss(0, 1) for _ in range(f)]
t.add_item(i, v)
# Build the index with n_trees trees. n_jobs=-1 uses all CPU cores.
t.build(10, n_jobs=-1)
# Save and load the index
index_path = 'test.ann'
t.save(index_path)
u = AnnoyIndex(f, 'euclidean')
u.load(index_path) # super fast, will just mmap the file
# Query for nearest neighbors
query_item_id = 0
k = 10 # Number of neighbors to retrieve
nearest_neighbors = u.get_nns_by_item(query_item_id, k)
print(f"Nearest neighbors for item {query_item_id}: {nearest_neighbors}")
query_vector = [random.gauss(0, 1) for _ in range(f)]
nearest_neighbors_by_vector = u.get_nns_by_vector(query_vector, k)
print(f"Nearest neighbors for a random vector: {nearest_neighbors_by_vector}")
# Clean up the created index file
if os.path.exists(index_path):
os.remove(index_path)