fuzzyset2: Fuzzy String Matching
fuzzyset2 is a Python library that provides a data structure for performing fuzzy string matching, akin to full-text search. It helps identify likely misspellings and approximate string matches by breaking strings into n-grams and using a reverse index and cosine similarity. It is a maintained fork of the original 'fuzzyset' package, addressing past installation and maintenance issues. The current version is 0.2.5, and it appears to be actively maintained with recent releases.
Common errors
-
error: command 'gcc' failed with exit status 1 (or similar C compilation error referencing `cfuzzyset.c`)
cause Attempting to install the original `fuzzyset` package which has known issues with its Cython compilation on various systems or missing `cfuzzyset.c` in the distribution.fixUse the maintained fork: `pip install fuzzyset2`. This version aims to resolve the underlying C compilation problems and provides wheels. -
ImportError: cannot import name 'cFuzzySet' from 'fuzzyset'
cause Trying to import `cFuzzySet` directly from `fuzzyset` when the Cython-optimized version `cfuzzyset` is not compiled or available in the environment.fixUse the recommended conditional import pattern: `try: from cfuzzyset import cFuzzySet as FuzzySet; except ImportError: from fuzzyset import FuzzySet`. Ensure Cython is installed (`pip install Cython`) for `cfuzzyset` to be built. -
fuzzy_set.get('query') returns an empty list or unexpected low-scoring results for visually similar strings.cause This can happen if the `gram_size_lower` or `gram_size_upper` parameters are too restrictive for the length of your strings, or if `use_levenshtein` is set to `False` causing less accurate scoring for transpositions/minor edits.fixWhen initializing `FuzzySet`, experiment with `gram_size_lower` and `gram_size_upper` (defaults are 2 and 3). Ensure `use_levenshtein=True` (default) for better accuracy with common misspellings. Also, check the input strings for leading/trailing whitespace or unexpected characters that might be removed during normalization.
Warnings
- breaking Users migrating from the original `fuzzyset` package might encounter import errors or C compilation issues if they don't explicitly install `fuzzyset2`.
- gotcha For performance-critical applications, consider using the Cython-optimized `cFuzzySet` if available, which can offer a roughly 15% performance increase.
- gotcha fuzzyset2 normalizes input strings by removing non-word characters (except spaces and commas) and converting them to lowercase before processing. This can lead to unexpected matches if case-sensitivity or special characters are critical for your matching logic.
- gotcha Adding a large number of words to a FuzzySet sequentially can be slow. Parallelization is not directly supported by the FuzzySet object itself.
Install
-
pip install fuzzyset2
Imports
- FuzzySet
from fuzzyset import FuzzySet
- cFuzzySet
from fuzzyset import cFuzzySet
from cfuzzyset import cFuzzySet
Quickstart
from fuzzyset import FuzzySet
# Initialize with an iterable or add strings later
a = FuzzySet(['apple', 'banana', 'orange'])
# Add a new string
a.add('aple')
# Get fuzzy matches
matches = a.get('appel')
print(f"Matches for 'appel': {matches}")
matches = a.get('banan')
print(f"Matches for 'banan': {matches}")
# Access by index (if only one perfect match or for illustration)
# Note: .get() is generally preferred for fuzzy matching
# matches = a['apple'] # This will return a list of (score, value) tuples
# print(matches)