{"id":2210,"library":"pyannote-metrics","title":"pyannote-metrics","description":"pyannote.metrics is an open-source Python library, currently at version 4.0.0, designed for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems. It provides a comprehensive set of evaluation metrics and a command-line interface, making it a critical tool for researchers in the field of speech processing. The library maintains a steady release cadence with regular updates and occasional major version changes that introduce breaking modifications.","status":"active","version":"4.0.0","language":"en","source_language":"en","source_url":"https://github.com/pyannote/pyannote-metrics","tags":["audio","speech","diarization","metrics","evaluation","speaker-diarization"],"install":[{"cmd":"pip install pyannote-metrics","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core data structures for handling annotations and segments, fundamental for defining reference and hypothesis inputs to metrics.","package":"pyannote.core"},{"reason":"Provides reproducible experimental protocols for multimedia databases, often used in conjunction with metrics for standardized evaluation.","package":"pyannote.database","optional":true}],"imports":[{"note":"Primary import for computing the Diarization Error Rate.","symbol":"DiarizationErrorRate","correct":"from pyannote.metrics.diarization import DiarizationErrorRate"},{"note":"Required from 'pyannote.core' to define ground truth and hypothesis segments for evaluation.","symbol":"Annotation","correct":"from pyannote.core import Annotation"},{"note":"Required from 'pyannote.core' to define temporal segments within annotations.","symbol":"Segment","correct":"from pyannote.core import Segment"}],"quickstart":{"code":"from pyannote.core import Segment, Annotation\nfrom pyannote.metrics.diarization import DiarizationErrorRate\n\n# Define a reference (ground truth) annotation\nreference = Annotation(uri='file1')\nreference[Segment(0, 10)] = 'A'\nreference[Segment(12, 20)] = 'B'\nreference[Segment(24, 27)] = 'A'\nreference[Segment(30, 40)] = 'C'\n\n# Define a hypothesis (system output) annotation\nhypothesis = Annotation(uri='file1')\nhypothesis[Segment(2, 13)] = 'a'\nhypothesis[Segment(13, 14)] = 'd'\nhypothesis[Segment(14, 20)] = 'b'\nhypothesis[Segment(22, 38)] = 'c'\nhypothesis[Segment(38, 40)] = 'd'\n\n# Instantiate the Diarization Error Rate metric\nmetric = DiarizationErrorRate()\n\n# Compute the DER\nder_value = metric(reference, hypothesis)\nprint(f\"Diarization Error Rate: {der_value:.3f}\")","lang":"python","description":"This quickstart demonstrates how to compute the Diarization Error Rate (DER) using `pyannote.metrics`. It involves creating `Annotation` objects for both the reference and hypothesis, defining temporal `Segment`s with speaker labels, and then instantiating and calling the `DiarizationErrorRate` class."},"warnings":[{"fix":"Re-evaluate existing systems with the new metric behavior. Understand how overlapping speech is handled by your diarization system and pyannote.metrics to correctly interpret results.","message":"Version 3.3.0 introduced a breaking change by improving diarization purity and coverage to explicitly account for overlapping regions, which might alter previously obtained metric values for systems that handle overlap differently.","severity":"breaking","affected_versions":">=3.3.0"},{"fix":"Always use the evaluation tool and parameters (e.g., `collar`, `skip_overlap`) specified by the benchmark you are targeting, and explicitly report all settings. Avoid direct comparisons of scores obtained from different tools.","message":"Comparison of evaluation scores across different diarization evaluation tools (e.g., `pyannote.metrics` vs. `md-eval`) is not recommended due to varying design choices, default parameters (like collar size), and handling of speaker mapping and overlapping speech.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Explicitly define and report the `collar` setting in your experiments. Be aware that different collar values can change DER by several percentage points, making results incomparable if not standardized.","message":"The `collar` parameter, typically set to 0.25 (250 ms exclusion around boundaries), significantly impacts DER. Manual annotations often lack audio sample-level precision, making a collar common practice. However, strict benchmarks may use `collar=0.0`.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Migrate to Python 3.10 or newer and convert non-RTTM annotation files to RTTM format for compatibility.","message":"Older versions (2.0.1) dropped support for Python 2.7 and all file formats except RTTM for evaluation. Ensure your environment uses Python 3.10+ and RTTM for input annotations.","severity":"breaking","affected_versions":"<2.0.1 (upgrade paths)"}],"env_vars":null,"last_verified":"2026-04-09T00:00:00.000Z","next_check":"2026-07-08T00:00:00.000Z"}