Whoosh
Whoosh is a fast, pure-Python library for full-text indexing, searching, and spell checking. It allows developers to add search functionality to applications and websites without external compilers or binary dependencies. The library is highly customizable and currently stable at version 2.7.4, maintained by the whoosh-community.
Warnings
- gotcha When adding documents, ensure text fields are passed as Unicode strings (e.g., `u"my text"` in Python 2 or regular strings in Python 3). Non-text fields that are stored but not indexed (STORED type) can be any pickle-able object.
- gotcha Whoosh does not inherently enforce uniqueness for documents. Calling `add_document` multiple times with identical data will result in multiple duplicate documents in the index. Use `update_document` with a `unique=True` field in your schema to overwrite existing documents.
- gotcha The `whoosh.index.create_in()` function requires the directory to exist before it's called. If the directory does not exist, a `FileNotFoundError` will occur.
- deprecated Direct manipulation of index files or relying on undocumented internal structures can lead to issues with future updates. Always use the public API for index management. Some older examples might show direct `FileStorage` usage without `index.create_in` or `index.open_dir` convenience functions.
Install
-
pip install whoosh
Imports
- create_in
from whoosh.index import create_in
- Schema
from whoosh.fields import Schema, TEXT, ID, STORED
- QueryParser
from whoosh.qparser import QueryParser
- index
from whoosh import index
Quickstart
import os
from whoosh.index import create_in, open_dir
from whoosh.fields import Schema, TEXT, ID
from whoosh.qparser import QueryParser
# 1. Define schema
schema = Schema(title=TEXT(stored=True), path=ID(stored=True), content=TEXT)
# 2. Create or open index directory
indexdir = "indexdir"
if not os.path.exists(indexdir):
os.mkdir(indexdir)
ix = create_in(indexdir, schema)
else:
ix = open_dir(indexdir)
# 3. Add documents
writer = ix.writer()
writer.add_document(title=u"First document", path=u"/a",
content=u"This is the first document we've added!")
writer.add_document(title=u"Second document", path=u"/b",
content=u"The second one is even more interesting!")
writer.commit()
# 4. Search documents
with ix.searcher() as searcher:
query_parser = QueryParser("content", ix.schema)
query = query_parser.parse("first")
results = searcher.search(query)
for hit in results:
print(f"Found: {hit['title']} at {hit['path']}")
# Clean up (optional: remove the index directory)
# import shutil
# shutil.rmtree(indexdir)