PySimString

raw JSON →
1.3.0 verified Mon Apr 27 auth: no python

PySimString is a Python implementation of the simstring fast string similarity search library. Version 1.3.0 supports Python 3.7-3.13 across multiple platforms. It provides efficient approximate string matching using various similarity measures (cosine, dice, jaccard, overlap, exact) with configurable feature sizes.

pip install pysimstring
error AttributeError: module 'simstring' has no attribute 'SimString'
cause Incorrect class import; the library exposes a `reader()` function, not a class.
fix
Replace from simstring import SimString with import simstring; db = simstring.reader().
error TypeError: retrieve() got an unexpected keyword argument 'measure'
cause Using outdated API where measure was positional or not accepted.
fix
Ensure pysimstring version >= 1.0.0; call db.retrieve('query', measure='cosine', threshold=0.7).
gotcha The default similarity measure is 'exact' (not cosine). Ensure you specify the desired measure via the `measure` parameter if you need fuzzy matching.
fix Explicitly set measure='cosine', 'dice', 'jaccard', or 'overlap'.
gotcha The reader must be created via `simstring.reader()`, not `simstring.SimString()` or any other constructor. The API changed from the original simstring C++ library.
fix Use `db = simstring.reader()`.
gotcha The database is stored in memory. Adding a large number of strings can consume significant memory.
fix Batch insert or use the `simstring.writer()` (if available) to write to disk; currently only reader/writer-based API.

Create a simstring database, add strings, and retrieve similar strings using cosine similarity.

import simstring

# Build a database from a list of strings
db = simstring.reader()
db.add('hello')
db.add('hallo')
db.add('hullo')
db.add('world')

# Use cosine similarity with threshold 0.7
results = db.retrieve('hallo', measure='cosine', threshold=0.7)
print(results)  # ['hello', 'hallo', 'hullo']