sacreBLEU
sacreBLEU is a Python library providing hassle-free computation of shareable, comparable, and reproducible BLEU, chrF, and TER scores for machine translation evaluation. It is actively maintained, with the current version being 2.6.0, and releases typically occur several times a year to add new test sets, tokenizers, or minor features.
Warnings
- breaking The default output format for the CLI utility changed from single-line to JSON. Tools or scripts parsing sacrebleu's standard output will need to adapt.
- breaking Python 3.5, 3.6, and 3.8 support have been dropped in recent major versions. The library now requires Python >=3.9.
- breaking The default smoothing method and floor value for `corpus_bleu()` and `sentence_bleu()` changed, potentially yielding different scores compared to earlier versions.
- gotcha Some language-specific tokenizers require extra dependencies that are not installed by default. For example, Japanese (`-tok ja`) and Korean (`-tok ko-mecab`) tokenizers need additional packages.
Install
-
pip install sacrebleu
Imports
- corpus_bleu
import sacrebleu score = sacrebleu.corpus_bleu(hyp, [ref])
- BLEU, CHRF, TER
from sacrebleu.metrics import BLEU, CHRF, TER
Quickstart
import sacrebleu
# Example hypothesis and reference sentences
hypothesis = "The cat sat on the mat."
references = [
"The cat is on the mat.",
"A cat sat on the mat."
]
# Calculate corpus BLEU score
# Note: sacrebleu expects lists of sentences, even for a single hypothesis/reference
bleu_score = sacrebleu.corpus_bleu([hypothesis], [references])
print(f"BLEU score: {bleu_score.score:.2f}")
print(f"BLEU string: {bleu_score.format()}")