String Similarity and Distance Measures
Strsimpy is a Python library that provides implementations for various string similarity and distance measures, including popular algorithms like Levenshtein, Jaro-Winkler, N-Gram, Cosine Similarity, and Jaccard Index. It's designed to be straightforward to use for text analysis and data matching tasks. The current version is 0.2.1. Releases are infrequent, typically addressing bug fixes or adding new algorithms.
Warnings
- breaking The package name was changed from `similarity` to `strsimpy`. Old import paths will fail.
- breaking The `WeightedLevenshtein` algorithm's API changed significantly in v0.1.7. It now uses functions for weight calculation instead of expecting Class or Objects, simplifying its usage.
- gotcha The `numpy` dependency was removed in v0.1.5 to lighten the package. If you were implicitly relying on `numpy` being installed alongside `strsimpy`, it might no longer be present.
- gotcha A `ZeroDivisionError` was fixed in ShingleBased algorithms (e.g., Jaccard, Cosine Similarity) for certain edge cases.
Install
-
pip install strsimpy
Imports
- Levenshtein
from strsimpy.levenshtein import Levenshtein
- JaroWinkler
from strsimpy.jaro_winkler import JaroWinkler
- Ngram
from strsimpy.ngram import Ngram
- WeightedLevenshtein
from strsimpy.weighted_levenshtein import Levenshtein
from strsimpy.weighted_levenshtein import WeightedLevenshtein
Quickstart
from strsimpy.levenshtein import Levenshtein
s0 = "안녕하세요"
s1 = "안녕하세유"
levenshtein = Levenshtein()
distance = levenshtein.distance(s0, s1)
print(f"Levenshtein distance between '{s0}' and '{s1}': {distance}")
s2 = "apple"
s3 = "aple"
distance2 = levenshtein.distance(s2, s3)
print(f"Levenshtein distance between '{s2}' and '{s3}': {distance2}")
from strsimpy.jaro_winkler import JaroWinkler
jaro_winkler = JaroWinkler()
similarity = jaro_winkler.similarity(s2, s3)
print(f"Jaro-Winkler similarity between '{s2}' and '{s3}': {similarity}")