seqeval: Sequence Labeling Evaluation
seqeval is a Python framework for sequence labeling evaluation. It provides metrics like F1 score, precision, recall, and a detailed classification report for tasks such as named-entity recognition and part-of-speech tagging. It is currently at version 1.2.2 and maintains an active release cadence, with updates often focusing on performance improvements and additional evaluation schemes.
Warnings
- breaking The `classification_report` behavior changed significantly in v1.0.0. It now allows explicit specification of the evaluation scheme (e.g., IOB1, IOB2, BILOU) which can alter how entities are counted. Older code relying on implicit scheme assumptions might produce different results.
- gotcha seqeval's metrics (especially F1, precision, recall) are calculated differently from `scikit-learn`'s for sequence labeling tasks. `seqeval` specifically evaluates *entities*, primarily focusing on non-'O' (Outside) tags, and does not count correctly predicted 'O' tags as true positives, only misclassified ones. `scikit-learn`'s metrics, when applied naively to token-level tags, will include all 'O' tags in its calculations, potentially leading to inflated scores if 'O' tags are abundant and correctly predicted.
- gotcha When using `mode='strict'` (e.g., in `f1_score` or `classification_report`), only exact matches for entity spans (including both boundaries and type) are considered correct. This is often the desired behavior for robust evaluation but can result in lower scores compared to the default mode which might be more lenient.
- gotcha Input `y_true` and `y_pred` for all `seqeval.metrics` functions must be lists of lists of *strings* (e.g., `[['B-PER', 'I-PER', 'O']]`), representing the sequence tags. Providing numerical label IDs instead of string tags will result in errors.
- gotcha Performance of evaluation, particularly in `strict` mode, was significantly improved in version 1.2.1. Older versions might exhibit slower computation times for large datasets or extensive evaluations.
Install
-
pip install seqeval
Imports
- accuracy_score
from seqeval.metrics import accuracy_score
- precision_score
from seqeval.metrics import precision_score
- recall_score
from seqeval.metrics import recall_score
- f1_score
from seqeval.metrics import f1_score
- classification_report
from seqeval.metrics import classification_report
- IOB2
from seqeval.scheme import IOB2
Quickstart
from seqeval.metrics import f1_score
from seqeval.metrics import classification_report
y_true = [['O', 'O', 'B-MISC', 'I-MISC', 'B-MISC', 'O', 'O'], ['B-PER', 'I-PER', 'O']]
y_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'B-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]
# Compute F1-score (micro average by default)
print(f"F1 Score (micro): {f1_score(y_true, y_pred):.2f}")
# Compute F1-score with different averaging methods
print(f"F1 Score (average=None): {f1_score(y_true, y_pred, average=None)}")
# Generate a full classification report
report = classification_report(y_true, y_pred, digits=2)
print("\nClassification Report:\n", report)