Tantivy Python Bindings
Tantivy-py provides official Python bindings for Tantivy, a high-performance full-text search engine library written in Rust and inspired by Apache Lucene. It offers fast indexing and search capabilities. The current version is 0.25.1, and the project maintains an active development cycle with relatively frequent releases of minor versions, often a few months apart.
Warnings
- gotcha To install `tantivy` from source (if no pre-compiled wheel is available for your system), you must have Rust installed and configured. This is a common requirement for Python libraries with Rust bindings.
- breaking Version 0.25.0 introduced a breaking API change by removing index sorting. Users relying on this feature will need to adjust their indexing and search strategies.
- gotcha Tantivy treats document data as immutable. To 'edit' a document, you must delete the existing document (by its `DocAddress` or a specific term query) and then reindex the updated version.
- gotcha Only one `IndexWriter` can be active at a time for a given index. While the `IndexWriter` itself is multithreaded, concurrent attempts to create multiple writers will fail.
- gotcha Search operations return a list of `(score, DocAddress)` tuples. To retrieve the actual document content, you must use the `DocAddress` with a `Searcher`'s `doc()` method, rather than receiving the document directly in search results.
- gotcha For incremental indexing and efficient document deletion, the field used to identify documents for deletion (e.g., a unique ID) must be an integer field, set to `indexed=True` and `fast=True` in the schema.
Install
-
pip install tantivy
Imports
- tantivy
import tantivy
- SchemaBuilder
from tantivy import SchemaBuilder
Quickstart
import tantivy
import os
# 1. Declare the schema
schema_builder = tantivy.SchemaBuilder()
schema_builder.add_text_field("title", stored=True, tokenizer_name="default")
schema_builder.add_text_field("body", stored=True, tokenizer_name="default")
schema_builder.add_integer_field("doc_id", stored=True, indexed=True)
schema = schema_builder.build()
# 2. Create an in-memory index (for persistent, specify a path)
# To use a persistent index, use: index = tantivy.Index(schema, path="/tmp/my_index")
index = tantivy.Index(schema)
# 3. Get an index writer and add documents
writer = index.writer(50_000_000) # 50MB memory arena
writer.add_document(tantivy.Document(title=["The Old Man and the Sea"], body=["He was an old man who fished alone in a skiff."], doc_id=[1]))
writer.add_document(tantivy.Document(title=["The Great Gatsby"], body=["In my younger and more vulnerable years my father gave me some advice."], doc_id=[2]))
writer.commit()
# 4. Get a reader and searcher
index.reload()
reader = index.reader()
searcher = reader.searcher()
# 5. Build and execute a query
query_parser = tantivy.QueryParser(schema, default_fields=["title", "body"])
query = query_parser.parse_query("old man")
hits = searcher.search(query, 10)
# 6. Retrieve documents
print("Search results:")
for score, doc_address in hits:
retrieved_doc = searcher.doc(doc_address)
print(f" Score: {score:.2f}, Doc ID: {retrieved_doc['doc_id'][0]}, Title: {retrieved_doc['title'][0]}")
# Example of retrieving a non-existent field (will be empty list)
missing_field = retrieved_doc.get('non_existent_field')
print(f" Non-existent field for last doc: {missing_field}") # Expected: []