PySimString
raw JSON → 1.3.0 verified Mon Apr 27 auth: no python
PySimString is a Python implementation of the simstring fast string similarity search library. Version 1.3.0 supports Python 3.7-3.13 across multiple platforms. It provides efficient approximate string matching using various similarity measures (cosine, dice, jaccard, overlap, exact) with configurable feature sizes.
pip install pysimstring Common errors
error AttributeError: module 'simstring' has no attribute 'SimString' ↓
cause Incorrect class import; the library exposes a `reader()` function, not a class.
fix
Replace
from simstring import SimString with import simstring; db = simstring.reader(). error TypeError: retrieve() got an unexpected keyword argument 'measure' ↓
cause Using outdated API where measure was positional or not accepted.
fix
Ensure pysimstring version >= 1.0.0; call
db.retrieve('query', measure='cosine', threshold=0.7). Warnings
gotcha The default similarity measure is 'exact' (not cosine). Ensure you specify the desired measure via the `measure` parameter if you need fuzzy matching. ↓
fix Explicitly set measure='cosine', 'dice', 'jaccard', or 'overlap'.
gotcha The reader must be created via `simstring.reader()`, not `simstring.SimString()` or any other constructor. The API changed from the original simstring C++ library. ↓
fix Use `db = simstring.reader()`.
gotcha The database is stored in memory. Adding a large number of strings can consume significant memory. ↓
fix Batch insert or use the `simstring.writer()` (if available) to write to disk; currently only reader/writer-based API.
Imports
- simstring
import simstring
Quickstart
import simstring
# Build a database from a list of strings
db = simstring.reader()
db.add('hello')
db.add('hallo')
db.add('hullo')
db.add('world')
# Use cosine similarity with threshold 0.7
results = db.retrieve('hallo', measure='cosine', threshold=0.7)
print(results) # ['hello', 'hallo', 'hullo']