JaroWinkler String Similarity
JaroWinkler is a high-performance Python library for approximate string matching, implementing Jaro and Jaro-Winkler similarity algorithms. Currently at version 2.0.1, it leverages the `rapidfuzz` library for its core implementations, offering significant speed advantages over alternatives. The project maintains an active development cycle, with a focus on optimization and ease of integration.
Common errors
-
ModuleNotFoundError: No module named 'jarowinkler'
cause The `jarowinkler` library is not installed in the active Python environment.fixRun `pip install jarowinkler` to install the library. -
AttributeError: module 'jarowinkler' has no attribute 'jaro_winkler_metric'
cause Attempting to use an API call (`jaro_winkler_metric`) from a different Jaro-Winkler library (e.g., `jaro-winkler` or `pyjarowinkler`) that is not part of this specific `jarowinkler` package.fixThe correct function in this library is `jarowinkler_similarity`. Update your code to `from jarowinkler import jarowinkler_similarity` and use `jarowinkler_similarity(str1, str2)`. -
TypeError: 'float' object cannot be interpreted as an integer (when passing non-string/non-sequence to similarity function)
cause One of the input arguments to `jaro_similarity` or `jarowinkler_similarity` is not a string or a sequence of hashable objects.fixEnsure both arguments passed to `jaro_similarity` or `jarowinkler_similarity` are strings or iterable sequences of hashable objects (e.g., lists of strings/numbers). For example, `jarowinkler_similarity('test', 123)` will fail, it should be `jarowinkler_similarity('test', '123')` or `jarowinkler_similarity('test', ['1','2','3'])`. -
ValueError: prefix_weight has to be between 0 and 0.25 (inclusive)
cause The `prefix_weight` parameter, when used with `jarowinkler_similarity` (or underlying `rapidfuzz` calls), was provided with a value outside its valid range.fixEnsure `prefix_weight` is set to a float between 0.0 and 0.25, inclusive. For example: `jarowinkler_similarity('foo', 'bar', prefix_weight=0.15)`.
Warnings
- breaking Version 2.0.0 dropped support for Python 3.6 and Python 3.7. Users on these Python versions must either upgrade Python or pin `jarowinkler` to `<2.0.0`.
- breaking Since v2.0.0, the library's internal implementations are deduplicated and now rely on `rapidfuzz`. While the API aims to be consistent, `rapidfuzz` is effectively a required runtime dependency. This change might subtly alter behavior or performance characteristics from pre-2.0.0 versions which used standalone C++ implementations.
- gotcha Jaro-Winkler similarity, by design, gives a higher weight to matching prefixes. This can sometimes lead to unexpectedly high similarity scores for strings that share a long common prefix but are otherwise quite different, or lower scores if there's no common prefix, even if the strings are otherwise similar.
- gotcha The functions `jaro_similarity` and `jarowinkler_similarity` can operate on any sequence of hashable objects, not just strings. While powerful, comparing sequences of mixed types or non-comparable hashables can yield unexpected results or `TypeError`s if `__hash__` or `__eq__` methods are not consistently defined.
Install
-
pip install jarowinkler
Imports
- jarowinkler_similarity
from jarowinkler import jarowinkler_similarity
- jaro_similarity
from jarowinkler import jaro_similarity
- jarowinkler_metric
from jarowinkler import jarowinkler_metric
from jarowinkler import jarowinkler_similarity
Quickstart
from jarowinkler import jaro_similarity, jarowinkler_similarity
# Calculate Jaro Similarity
sim_jaro = jaro_similarity("Johnathan", "Jonathan")
print(f"Jaro Similarity: {sim_jaro:.4f}")
# Calculate Jaro-Winkler Similarity
sim_jw = jarowinkler_similarity("Johnathan", "Jonathan")
print(f"Jaro-Winkler Similarity: {sim_jw:.4f}")
# Using with a score cutoff
sim_jw_cutoff = jarowinkler_similarity("apple", "aple", score_cutoff=0.9)
print(f"Jaro-Winkler with cutoff (0.9): {sim_jw_cutoff:.4f}")
# Can also be used with sequences of hashable objects
list1 = ["this", "is", "an", "example"]
list2 = ["this", "is", "a", "example"]
sim_list = jarowinkler_similarity(list1, list2)
print(f"Similarity of lists: {sim_list:.4f}")