{"id":4760,"library":"seqeval","title":"seqeval: Sequence Labeling Evaluation","description":"seqeval is a Python framework for sequence labeling evaluation. It provides metrics like F1 score, precision, recall, and a detailed classification report for tasks such as named-entity recognition and part-of-speech tagging. It is currently at version 1.2.2 and maintains an active release cadence, with updates often focusing on performance improvements and additional evaluation schemes.","status":"active","version":"1.2.2","language":"en","source_language":"en","source_url":"https://github.com/chakki-works/seqeval","tags":["nlp","metrics","sequence-labeling","named-entity-recognition","pos-tagging","evaluation"],"install":[{"cmd":"pip install seqeval","lang":"bash","label":"Install stable version"}],"dependencies":[],"imports":[{"symbol":"accuracy_score","correct":"from seqeval.metrics import accuracy_score"},{"symbol":"precision_score","correct":"from seqeval.metrics import precision_score"},{"symbol":"recall_score","correct":"from seqeval.metrics import recall_score"},{"symbol":"f1_score","correct":"from seqeval.metrics import f1_score"},{"symbol":"classification_report","correct":"from seqeval.metrics import classification_report"},{"note":"Used to specify a tagging scheme for metrics when needed.","symbol":"IOB2","correct":"from seqeval.scheme import IOB2"}],"quickstart":{"code":"from seqeval.metrics import f1_score\nfrom seqeval.metrics import classification_report\n\ny_true = [['O', 'O', 'B-MISC', 'I-MISC', 'B-MISC', 'O', 'O'], ['B-PER', 'I-PER', 'O']]\ny_pred = [['O', 'O', 'B-MISC', 'I-MISC', 'B-MISC', 'I-MISC', 'O'], ['B-PER', 'I-PER', 'O']]\n\n# Compute F1-score (micro average by default)\nprint(f\"F1 Score (micro): {f1_score(y_true, y_pred):.2f}\")\n\n# Compute F1-score with different averaging methods\nprint(f\"F1 Score (average=None): {f1_score(y_true, y_pred, average=None)}\")\n\n# Generate a full classification report\nreport = classification_report(y_true, y_pred, digits=2)\nprint(\"\\nClassification Report:\\n\", report)\n","lang":"python","description":"Calculate the F1-score and generate a detailed classification report for sequence labeling predictions."},"warnings":[{"fix":"Review calls to `classification_report` and explicitly pass the desired `scheme` and `mode` parameters (e.g., `classification_report(y_true, y_pred, scheme=IOB2, mode='strict')`) to ensure correct evaluation behavior. Refer to the documentation for supported schemes.","message":"The `classification_report` behavior changed significantly in v1.0.0. It now allows explicit specification of the evaluation scheme (e.g., IOB1, IOB2, BILOU) which can alter how entities are counted. Older code relying on implicit scheme assumptions might produce different results.","severity":"breaking","affected_versions":">=1.0.0"},{"fix":"Be aware of this fundamental difference when comparing evaluation results. For true entity-level evaluation in NLP sequence labeling, `seqeval` is generally preferred. If comparing with `scikit-learn`, ensure a clear understanding of what each metric is actually measuring.","message":"seqeval's metrics (especially F1, precision, recall) are calculated differently from `scikit-learn`'s for sequence labeling tasks. `seqeval` specifically evaluates *entities*, primarily focusing on non-'O' (Outside) tags, and does not count correctly predicted 'O' tags as true positives, only misclassified ones. `scikit-learn`'s metrics, when applied naively to token-level tags, will include all 'O' tags in its calculations, potentially leading to inflated scores if 'O' tags are abundant and correctly predicted.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Understand the implications of `mode='strict'`. If evaluating with `mode='strict'`, ensure your `y_true` and `y_pred` are consistent with a strict tagging scheme (like IOB2 or BILOU) and that this level of strictness is appropriate for your task.","message":"When using `mode='strict'` (e.g., in `f1_score` or `classification_report`), only exact matches for entity spans (including both boundaries and type) are considered correct. This is often the desired behavior for robust evaluation but can result in lower scores compared to the default mode which might be more lenient.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Ensure that your true and predicted labels are properly converted from numerical IDs to their corresponding string tags (e.g., 'O', 'B-PER', 'I-PER') before passing them to `seqeval` metric functions.","message":"Input `y_true` and `y_pred` for all `seqeval.metrics` functions must be lists of lists of *strings* (e.g., `[['B-PER', 'I-PER', 'O']]`), representing the sequence tags. Providing numerical label IDs instead of string tags will result in errors.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Upgrade to `seqeval` version 1.2.1 or newer to benefit from performance enhancements, especially if you are performing evaluations in `strict` mode or on large datasets.","message":"Performance of evaluation, particularly in `strict` mode, was significantly improved in version 1.2.1. Older versions might exhibit slower computation times for large datasets or extensive evaluations.","severity":"gotcha","affected_versions":"<1.2.1"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}