TextDistance
TextDistance is a Python library offering over 30 algorithms to compute the similarity or distance between two or more sequences. It provides a common interface for various string metrics, including edit-based, token-based, and phonetic algorithms. The library is actively maintained with frequent updates, with the current version being 4.6.3. [2, 5, 8]
Warnings
- breaking The `abydos` library support was dropped in version 4.6.0. If your code relied on the `textdistance` integration with `abydos`, it will break.
- breaking Python 2 support was dropped in version 4.2.0. The library now explicitly supports Python 3.6+.
- gotcha For optimal performance, especially in production environments, it is highly recommended to install `textdistance` with `[extras]` (e.g., `pip install textdistance[extras]`). Without these optional dependencies (like `rapidfuzz` and `numpy`), the pure Python implementations are significantly slower. [5, 7, 10]
- gotcha The `Levenstein` algorithm was fixed in version 4.6.2 to ensure its return type is consistently `int`. If your application implicitly handled non-integer return values for Levenshtein distance prior to this version, its behavior might subtly change.
- gotcha By default, `textdistance` may try to use external libraries (like `rapidfuzz`) if they are installed and provide faster implementations for a given algorithm. This behavior is controlled by an internal `libraries.json` file. If you need to explicitly control which implementation is used or troubleshoot performance, be aware of this mechanism and the `external` argument. [4, 14]
Install
-
pip install textdistance -
pip install textdistance[extras]
Imports
- levenshtein
import textdistance distance = textdistance.levenshtein.distance('text', 'test') - JaroWinkler
from textdistance import JaroWinkler jw = JaroWinkler() distance = jw.distance('martha', 'marhta')
Quickstart
import textdistance
# Calculate Levenshtein distance
str1 = "kitten"
str2 = "sitting"
distance = textdistance.levenshtein.distance(str1, str2)
similarity = textdistance.levenshtein.similarity(str1, str2)
normalized_distance = textdistance.levenshtein.normalized_distance(str1, str2)
normalized_similarity = textdistance.levenshtein.normalized_similarity(str1, str2)
print(f"Strings: '{str1}', '{str2}'")
print(f"Levenshtein Distance: {distance}")
print(f"Levenshtein Similarity: {similarity}")
print(f"Levenshtein Normalized Distance: {normalized_distance:.2f}")
print(f"Levenshtein Normalized Similarity: {normalized_similarity:.2f}")
# Example with another algorithm (Jaro-Winkler)
str3 = "martha"
str4 = "marhta"
jaro_winkler_similarity = textdistance.jaro_winkler(str3, str4)
print(f"\nJaro-Winkler Similarity between '{str3}' and '{str4}': {jaro_winkler_similarity:.2f}")