{"id":2756,"library":"rouge-score","title":"ROUGE Score","description":"The `rouge-score` library is a pure Python implementation of the ROUGE-1.5.5 evaluation metric, designed to closely replicate the results of the original Perl script. It provides functionalities for calculating ROUGE-N, ROUGE-L (sentence-level and summary-level), text normalization, and optional Porter stemming. The library is currently at version 0.1.2, released in July 2022, and while the version updates are infrequent, it remains an actively used and stable package maintained by Google for evaluating text generation tasks like summarization.","status":"active","version":"0.1.2","language":"en","source_language":"en","source_url":"https://github.com/google-research/google-research/tree/master/rouge","tags":["nlp","text-evaluation","summarization","machine-translation","metrics","natural-language-processing"],"install":[{"cmd":"pip install rouge-score","lang":"bash","label":"Install stable version"}],"dependencies":[{"reason":"Compatibility utilities.","package":"six","optional":false},{"reason":"Used for tokenization and Porter stemming when `use_stemmer=True`.","package":"nltk","optional":false},{"reason":"Abseil Python Common Libraries, for logging and utilities.","package":"absl-py","optional":false},{"reason":"Numerical operations.","package":"numpy","optional":false}],"imports":[{"symbol":"RougeScorer","correct":"from rouge_score import rouge_scorer"}],"quickstart":{"code":"from rouge_score import rouge_scorer\n\n# Initialize the scorer with desired ROUGE types and optional stemming\nscorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], use_stemmer=True)\n\n# Define the reference (target) and candidate (prediction) summaries\nreference_summary = \"The quick brown fox jumps over the lazy dog. It's a sunny day.\"\ncandidate_summary = \"A quick brown fox leaps over a sleeping dog. The weather is nice.\"\n\n# Calculate scores\nscores = scorer.score(reference_summary, candidate_summary)\n\n# Print the results for each ROUGE type (precision, recall, f-measure)\nfor key, value in scores.items():\n    print(f\"{key}:\")\n    print(f\"  Precision: {value.precision:.4f}\")\n    print(f\"  Recall: {value.recall:.4f}\")\n    print(f\"  F1 Score: {value.fmeasure:.4f}\")","lang":"python","description":"This example demonstrates how to initialize `RougeScorer` for common ROUGE types (ROUGE-1, ROUGE-2, ROUGE-L, ROUGE-Lsum) with stemming enabled. It then computes and prints the precision, recall, and F1-score for a given reference and candidate text. Ensure `nltk` data (like 'punkt' and 'wordnet') is available if `use_stemmer=True`."},"warnings":[{"fix":"Always specify the exact ROUGE package and version used in research or production. When comparing results, ensure the same ROUGE implementation, configuration, and preprocessing steps are applied.","message":"Beware of inconsistent ROUGE implementations across different Python packages. Many ROUGE libraries exist, and not all adhere strictly to the ROUGE-1.5.5 standard or produce identical results, leading to irreproducible and incomparable evaluation scores. `rouge-score` aims to replicate the Perl script's behavior, but results might differ from other Python wrappers or custom implementations.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Complement ROUGE scores with other evaluation metrics that assess semantic similarity (e.g., BERTScore), human evaluations for quality assessment, and qualitative analysis to gain a comprehensive understanding of text generation quality.","message":"ROUGE metrics primarily evaluate lexical overlap (n-gram matching) and inherently suffer from 'semantic blindness'. They do not fully capture semantic meaning, logical coherence, factual correctness, or fluency. Systems can achieve high ROUGE scores by repeating phrases or using similar vocabulary without truly understanding the content, and conversely, well-phrased paraphrases might get lower scores.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Carefully select between `rougeL` and `rougeLsum` based on your data's structure and the desired evaluation behavior. `rougeLsum` is often preferred for multi-sentence summaries where newlines delineate sentences.","message":"The `rouge-score` library distinguishes between two flavors of ROUGE-L: `rougeL` (sentence-level LCS) and `rougeLsum` (summary-level union-LCS). The choice depends on whether newlines in your text should be treated as sentence boundaries for LCS computation. Misunderstanding this distinction can lead to different and potentially incorrect evaluation results for multi-sentence summaries.","severity":"gotcha","affected_versions":"All versions"},{"fix":"If stopword removal is desired, implement it as a preprocessing step on both reference and candidate texts before passing them to `RougeScorer`. Be mindful of how this might affect comparability with other ROUGE setups.","message":"The `rouge-score` library supports optional Porter stemming via `use_stemmer=True` but explicitly *does not* include stopword removal. This differs from some configurations of the original Perl ROUGE script and other Python ROUGE implementations. If you rely on stopword removal for specific tasks, this needs to be handled externally.","severity":"gotcha","affected_versions":"All versions"},{"fix":"Use multiple, diverse human reference summaries if possible. Acknowledge and report the dependency on references, and consider sensitivity analysis to understand how scores change with different reference sets. Focus on relative improvements rather than absolute scores.","message":"ROUGE scores are highly dependent on the quality and number of human-written reference summaries. Different reference summaries for the same source text can lead to significantly varying ROUGE scores, even if all references are of high quality, reflecting the subjective nature of summarization. This variability can make it difficult to objectively compare models.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}