Jaro-Winkler String Similarity
The `jaro-winkler` library provides Python implementations of the Jaro and Jaro-Winkler string similarity metrics. It allows for comparison of two strings, returning a score from 0 (no match) to 1 (perfect match). The current version is 2.0.3, offering standard and customizable versions of the functions. While not explicitly stated, the project's release cadence appears to be moderate, with major updates occurring over several years.
Warnings
- gotcha There are two distinct Python packages with very similar names: `jaro-winkler` (this library, which imports as `jaro`) and `jarowinkler` (a different, often faster implementation by maxbachmann, which imports as `jarowinkler`). Users often confuse them, leading to `ModuleNotFoundError` or unexpected behavior if `pip install` one but `import` the other.
- gotcha The Jaro-Winkler algorithm, by design, gives a higher weight to matching prefixes. This means strings with a common beginning will naturally score higher, even if other parts of the strings are very different. This behavior is usually desirable for name matching but can be unexpected in other contexts.
- gotcha While often referred to as a distance metric, the Jaro-Winkler 'distance' (1 - similarity) does not strictly adhere to the mathematical definition of a metric because it may not satisfy the triangle inequality.
- gotcha For optimal performance, especially when dealing with very large datasets or requiring integration with tools like RapidFuzz, the `jarowinkler` (no hyphen) package by maxbachmann (which implements the RapidFuzz C-API) may offer significantly faster computation compared to this `jaro-winkler` library.
Install
-
pip install jaro-winkler
Imports
- jaro_winkler_metric
from jaro import jaro_winkler_metric
- jaro_metric
from jaro import jaro_metric
- original_metric
from jaro import original_metric
Quickstart
import jaro
# Calculate Jaro-Winkler similarity
score_winkler = jaro.jaro_winkler_metric('SHACKLEFORD', 'SHACKELFORD')
print(f"Jaro-Winkler Similarity: {score_winkler}")
# Calculate Jaro similarity
score_jaro = jaro.jaro_metric('MARTHA', 'MARHTA')
print(f"Jaro Similarity: {score_jaro}")