ROUGE Score
The `rouge-score` library is a pure Python implementation of the ROUGE-1.5.5 evaluation metric, designed to closely replicate the results of the original Perl script. It provides functionalities for calculating ROUGE-N, ROUGE-L (sentence-level and summary-level), text normalization, and optional Porter stemming. The library is currently at version 0.1.2, released in July 2022, and while the version updates are infrequent, it remains an actively used and stable package maintained by Google for evaluating text generation tasks like summarization.
Warnings
- gotcha Beware of inconsistent ROUGE implementations across different Python packages. Many ROUGE libraries exist, and not all adhere strictly to the ROUGE-1.5.5 standard or produce identical results, leading to irreproducible and incomparable evaluation scores. `rouge-score` aims to replicate the Perl script's behavior, but results might differ from other Python wrappers or custom implementations.
- gotcha ROUGE metrics primarily evaluate lexical overlap (n-gram matching) and inherently suffer from 'semantic blindness'. They do not fully capture semantic meaning, logical coherence, factual correctness, or fluency. Systems can achieve high ROUGE scores by repeating phrases or using similar vocabulary without truly understanding the content, and conversely, well-phrased paraphrases might get lower scores.
- gotcha The `rouge-score` library distinguishes between two flavors of ROUGE-L: `rougeL` (sentence-level LCS) and `rougeLsum` (summary-level union-LCS). The choice depends on whether newlines in your text should be treated as sentence boundaries for LCS computation. Misunderstanding this distinction can lead to different and potentially incorrect evaluation results for multi-sentence summaries.
- gotcha The `rouge-score` library supports optional Porter stemming via `use_stemmer=True` but explicitly *does not* include stopword removal. This differs from some configurations of the original Perl ROUGE script and other Python ROUGE implementations. If you rely on stopword removal for specific tasks, this needs to be handled externally.
- gotcha ROUGE scores are highly dependent on the quality and number of human-written reference summaries. Different reference summaries for the same source text can lead to significantly varying ROUGE scores, even if all references are of high quality, reflecting the subjective nature of summarization. This variability can make it difficult to objectively compare models.
Install
-
pip install rouge-score
Imports
- RougeScorer
from rouge_score import rouge_scorer
Quickstart
from rouge_score import rouge_scorer
# Initialize the scorer with desired ROUGE types and optional stemming
scorer = rouge_scorer.RougeScorer(['rouge1', 'rouge2', 'rougeL', 'rougeLsum'], use_stemmer=True)
# Define the reference (target) and candidate (prediction) summaries
reference_summary = "The quick brown fox jumps over the lazy dog. It's a sunny day."
candidate_summary = "A quick brown fox leaps over a sleeping dog. The weather is nice."
# Calculate scores
scores = scorer.score(reference_summary, candidate_summary)
# Print the results for each ROUGE type (precision, recall, f-measure)
for key, value in scores.items():
print(f"{key}:")
print(f" Precision: {value.precision:.4f}")
print(f" Recall: {value.recall:.4f}")
print(f" F1 Score: {value.fmeasure:.4f}")