jellyfish - Approximate and Phonetic String Matching
Jellyfish is a Python library for approximate and phonetic matching of strings. It offers a comprehensive collection of algorithms including Levenshtein, Damerau-Levenshtein, Jaro, and Jaro-Winkler distances for string comparison, alongside phonetic encodings such as American Soundex, Metaphone, NYSIIS, and Match Rating Codex. This makes it an essential tool for tasks like data cleaning, typo correction, and record linkage. The library is actively maintained, with the current version being 1.2.1, and releases typically focus on bug fixes and performance improvements.
Warnings
- breaking The functions `jellyfish.jaro_distance` and `jellyfish.jaro_winkler` were deprecated in versions 0.8.x and completely removed in version 1.0.1.
- breaking The `jellyfish.porter_stem` function was removed in version 0.10.0 (March 2023).
- breaking Since version 1.0.3, the `jellyfish.match_rating_codex` function now raises a `ValueError` if passed non-alphabetic characters.
- gotcha Versions of `jellyfish` prior to 0.7 supported Python 2.x. All versions from 0.7 onwards (including current 1.2.1) require Python 3. The library explicitly requires Python >=3.9.
Install
-
pip install jellyfish
Imports
- jellyfish
import jellyfish
- levenshtein_distance
jellyfish.levenshtein_distance(s1, s2)
- jaro_similarity
jellyfish.jaro_similarity(s1, s2)
- jaro_winkler_similarity
jellyfish.jaro_winkler_similarity(s1, s2)
Quickstart
import jellyfish
# String comparison
s1 = "jellyfish"
s2 = "smellyfish"
lev_dist = jellyfish.levenshtein_distance(s1, s2)
jaro_sim = jellyfish.jaro_similarity(s1, s2)
dam_lev_dist = jellyfish.damerau_levenshtein_distance("jellyfihs", "jellyfish")
print(f"Levenshtein Distance: {lev_dist}")
print(f"Jaro Similarity: {jaro_sim}")
print(f"Damerau-Levenshtein Distance: {dam_lev_dist}")
# Phonetic encoding
metaphone_code = jellyfish.metaphone("Jellyfish")
soundex_code = jellyfish.soundex("Jellyfish")
nysiis_code = jellyfish.nysiis("Jellyfish")
print(f"Metaphone: {metaphone_code}")
print(f"Soundex: {soundex_code}")
print(f"NYSIIS: {nysiis_code}")