ROUGE Metric
A fast Python implementation of full ROUGE metrics for automatic summarization evaluation, also providing a Python wrapper for the official ROUGE-1.5.5.pl Perl script. It supports various ROUGE variants (N, L, W, S, SU) and multi-reference evaluation. The library is actively maintained with periodic updates.
Common errors
-
FileNotFoundError: [Errno 2] No such file or directory: 'ROUGE-1.5.5.pl'
cause This error typically occurs when using `rouge_metric.PerlRouge` on a system (especially Windows) where the ROUGE-1.5.5.pl script or its Perl interpreter (e.g., Strawberry Perl) is not installed or not correctly added to the system's PATH.fixInstall Strawberry Perl (on Windows) and ensure its binary folder is added to your system's PATH environment variable. Alternatively, use the pure Python implementation `rouge_metric.Rouge` if Perl script compatibility is not strictly required. -
ModuleNotFoundError: No module named 'rouge_metric'
cause The `rouge-metric` package is not installed in your current Python environment.fixRun `pip install rouge-metric` in your terminal to install the library. -
TypeError: evaluate() missing 1 required positional argument: 'reference'
cause The `evaluate` method of the `Rouge` class requires both a hypothesis and a reference string as input.fixEnsure you pass both the `hypothesis` and `reference` strings (or tokenized lists for `evaluate_from_tokens`) to the evaluation method, e.g., `rouge.evaluate(hypothesis_text, reference_text)`.
Warnings
- gotcha ROUGE metrics primarily rely on n-gram overlap and do not capture semantic meaning or contextual understanding. This can lead to high scores for syntactically similar but semantically divergent texts. It is recommended to complement ROUGE with other metrics (e.g., BERTScore) or human evaluation for a comprehensive assessment.
- gotcha The multi-document evaluation results from the pure Python implementation (rouge_metric.Rouge) may be slightly different from those produced by the official ROUGE-1.5.5.pl Perl script (accessed via rouge_metric.PerlRouge) because the Python implementation does not use bootstrap resampling.
- gotcha The pure Python implementation (rouge_metric.Rouge) expects pre-tokenized sentences (lists of tokens). Preprocessing steps like tokenization, stemming, and stopword removal are left to the client, which can impact scores if not handled consistently.
- deprecated The `PerlRouge` wrapper, which calls the official ROUGE-1.5.5.pl script, is primarily intended for English corpora. For non-English summaries, it is recommended to use the pure Python implementation (`rouge_metric.Rouge`).
Install
-
pip install rouge-metric -
pip install git+https://github.com/li-plus/rouge-metric.git@master
Imports
- Rouge
from rouge_metric import Rouge
- PerlRouge
from rouge_metric import PerlRouge
Quickstart
from rouge_metric import Rouge hypothesis = 'The cat sat on the mat.' reference = 'The cat was on the mat.' rouge = Rouge() scores = rouge.evaluate(hypothesis, reference) print(scores) # Example with multiple references (list of lists of tokens) hyp_tokens = ['the', 'cat', 'sat', 'on', 'the', 'mat'] ref1_tokens = ['the', 'cat', 'was', 'on', 'the', 'mat'] ref2_tokens = ['a', 'feline', 'was', 'resting', 'on', 'the', 'rug'] scores_multi_ref = rouge.evaluate_from_tokens(hyp_tokens, [ref1_tokens, ref2_tokens]) print(scores_multi_ref)