python-crfsuite
python-crfsuite is a Python binding for CRFsuite, a fast implementation of Conditional Random Fields (CRFs) for labeling sequential data. It's widely used in Natural Language Processing (NLP) for tasks like Named Entity Recognition (NER), Part-of-Speech (POS) tagging, and other sequence labeling problems. The current version is 0.9.12, and releases primarily focus on Python version compatibility and stability.
Warnings
- breaking Version 0.9.12 dropped support for Python 3.6, 3.7, 3.8, and 3.9. Users on these older Python versions must either upgrade their Python environment or pin to an older `python-crfsuite` version.
- gotcha The PyPI package name is `python-crfsuite`, but the module to import in your Python code is `pycrfsuite`.
- gotcha The input data format for `Trainer.append()` and `Tagger.tag()` requires a list of feature lists for each item in the sequence. Each feature list is typically a list of strings (e.g., `[['feature1', 'feature2'], ['feature3']]`). Incorrectly formatted input will lead to errors.
Install
-
pip install python-crfsuite
Imports
- Trainer
import pycrfsuite trainer = pycrfsuite.Trainer(...)
- Tagger
import pycrfsuite tagger = pycrfsuite.Tagger(...)
Quickstart
import pycrfsuite
import os
# Sample data (features, labels)
X_train = [
[['walk', 'big'], ['dog']],
[['eat', 'apple'], ['red', 'apple']],
[['run', 'fast'], ['cat']]
]
y_train = [
['VERB', 'NOUN'],
['VERB', 'NOUN'],
['VERB', 'NOUN']
]
# 1. Train a CRF model
trainer = pycrfsuite.Trainer(verbose=False)
for xseq, yseq in zip(X_train, y_train):
trainer.append(xseq, yseq)
trainer.set_params({
'c1': 1.0, # coefficient for L1 penalty
'c2': 1e-3, # coefficient for L2 penalty
'max_iterations': 50, # stop earlier
'feature.possible_transitions': True
})
model_filename = 'model.crfsuite'
trainer.train(model_filename)
print(f"Model trained and saved to '{model_filename}'")
# 2. Use the trained model for tagging
tagger = pycrfsuite.Tagger()
tagger.open(model_filename)
X_test = [
[['see', 'small'], ['dog']]
]
predicted_tags = [tagger.tag(xseq) for xseq in X_test]
print(f"Test sequence: {X_test}")
print(f"Predicted tags: {predicted_tags}")
# Clean up the model file
os.remove(model_filename)