cdifflib
cdifflib is a Python library that provides a C implementation of parts of Python's standard `difflib` module, specifically focusing on `SequenceMatcher`. It creates a `CSequenceMatcher` type which inherits most functions from `difflib.SequenceMatcher`, offering up to 4x speed improvement when diffing large streams. The current version is 1.2.9, with irregular but ongoing maintenance releases to support newer Python versions and address issues.
Common errors
-
AttributeError: module 'cdifflib' has no attribute 'CSequenceMatcher'
cause This typically occurs in older `cdifflib` versions (pre-1.2.0) on Python 3 environments where the C extension was not properly built and linked during installation, preventing `CSequenceMatcher` from being exposed.fixEnsure you are using `cdifflib>=1.2.0`. If the problem persists, ensure your system has a C compiler (e.g., `build-essential` on Linux, Xcode on macOS, C++ Build Tools on Windows) if a pre-compiled wheel is not available for your platform. Then, try `pip install --upgrade cdifflib`. -
TypeError: unhashable type: 'list'
cause The C implementation of `CSequenceMatcher` requires individual elements within the `a` and `b` sequences to be hashable. When a sequence contains mutable types like lists or dictionaries, this error is raised.fixModify your input sequences so that all their elements are hashable. For example, convert lists to tuples: `sm = CSequenceMatcher(None, [('a',), ('b',)], [('a',), ('c',)])`. -
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/
cause On Windows, if a pre-compiled wheel for your specific Python version and architecture is not available, `pip` attempts to compile `cdifflib` from source, which requires a compatible C/C++ compiler.fixInstall the "Build Tools for Visual Studio" from the provided Microsoft link, ensuring you select the 'Desktop development with C++' workload. Alternatively, try to use a Python version for which `cdifflib` provides pre-compiled wheels.
Warnings
- gotcha The C implementation of `CSequenceMatcher` internally converts input sequences (`a` and `b`) to `list` type if they are not already lists. While convenient, this implicit conversion can incur performance or memory overhead for very large iterables if not anticipated.
- gotcha Elements within the input sequences (`a` and `b`) must be hashable. The underlying C implementation performs hashing on sequence items. Passing sequences containing unhashable types (e.g., mutable lists or dictionaries) will result in a `TypeError: unhashable type: ...`.
- breaking Prior to version 1.2.0, `cdifflib` had installation issues on Python 3, particularly when installing from source or without pre-compiled wheels. This often led to `AttributeError: module 'cdifflib' has no attribute 'CSequenceMatcher'` as the C extension failed to build correctly.
Install
-
pip install cdifflib
Imports
- CSequenceMatcher
import cdifflib
from cdifflib import CSequenceMatcher
- SequenceMatcher_monkey_patch
import difflib from cdifflib import CSequenceMatcher difflib.SequenceMatcher = CSequenceMatcher
Quickstart
from cdifflib import CSequenceMatcher
# Example 1: Basic sequence matching
s = CSequenceMatcher(None, ' abcd', 'abcd abcd')
match = s.find_longest_match(0, 5, 0, 9)
print(f"Longest match: {match}")
# Example 2: With custom junk filter
s2 = CSequenceMatcher(lambda x: x == " ",
"private Thread currentThread;",
"private volatile Thread currentThread;")
ratio = round(s2.ratio(), 3)
print(f"Similarity ratio: {ratio}")