N-gram Fuzzy Search
The `ngram` library provides a `set` subclass for efficient fuzzy searching of members based on N-gram string similarity. It extends Python's built-in `set` class and offers static methods to compare string pairs. The N-grams are character-based, not word-based, focusing on string similarity rather than language modeling. The library is actively maintained, with the current version being 4.0.3, and updates are released as needed.
Common errors
-
ModuleNotFoundError: No module named 'ngram'
cause The `ngram` package has not been installed in your current Python environment.fixRun `pip install ngram` to install the library. -
AttributeError: module 'ngram' has no attribute 'NGram'
cause This typically occurs when your Python script is named `ngram.py`. When you try to import `NGram` from `ngram`, Python tries to import from your own script rather than the installed library, and your script does not contain the `NGram` class.fixRename your Python script to something other than `ngram.py` (e.g., `my_ngram_app.py`) and try running it again.
Warnings
- gotcha The `ngram` library is designed for character-based N-grams by default, not word-based. This means it splits strings into sequences of characters, not words. If you require word N-grams for natural language processing tasks, you will need to pre-process your text or use a different library (e.g., NLTK).
- gotcha When initializing `NGram` with a `key` function to convert items to strings (e.g., `NGram(items, key=str)` or `NGram(items, key=lambda x: x.name)`), using an anonymous (lambda) function will prevent the resulting `NGram` object from being pickled (serialized).
- gotcha In Python 2, `NGram` could behave unexpectedly with non-ASCII byte-strings due to splitting on byte boundaries. While Python 3 primarily uses Unicode strings, ensuring all inputs to `NGram` are proper Unicode strings is crucial for correct multi-byte character handling.
Install
-
pip install ngram
Imports
- NGram
import ngram; ngram.NGram()
from ngram import NGram
Quickstart
from ngram import NGram
# Initialize an NGram object with a list of items
# N (default 3) is the size of n-grams to use for comparison
fuzzy_set = NGram(N=2, items=['apple', 'apricot', 'banana', 'orange', 'grape'])
# Add more items to the set
fuzzy_set.add('apply')
# Search for items similar to a query string
# The threshold (default 0.7) determines the minimum similarity score
results = fuzzy_set.search('appl', threshold=0.7)
print(f"Searching for 'appl': {results}")
# Expected: [('apple', 1.0), ('apply', 0.8), ('apricot', 0.75)] (scores may vary based on N)
# Directly compare two strings
similarity = NGram.compare('apple', 'apply', N=2)
print(f"Similarity between 'apple' and 'apply': {similarity}")