BERTScore
BERTScore is a Python library that provides a PyTorch implementation of the BERTScore metric, a robust evaluation metric for text generation tasks. It leverages pre-trained BERT embeddings to compute a similarity score between generated and reference texts, addressing limitations of traditional metrics like BLEU. The library is actively maintained with regular updates and is currently at version 0.3.13.
Warnings
- breaking BERTScore often requires specific versions of the `transformers` library. Incompatible versions can lead to `KeyError`, `AttributeError`, or incorrect scores due to changes in the underlying Hugging Face API.
- breaking The command-line option for rescaling with a baseline was changed from `--rescale-with-baseline` to `--rescale_with_baseline` for consistency with other options.
- gotcha Using the `--use_fast_tokenizer` option (or `use_fast_tokenizer=True` in Python) with Hugging Face transformers can lead to different BERTScore results due to subtle differences in tokenizer implementations compared to default slow tokenizers.
- gotcha The default or recommended model for BERTScore has evolved. While RoBERTa is often the default, models like DeBERTa (e.g., `microsoft/deberta-xlarge-mnli`) have shown higher correlation with human scores on some datasets.
Install
-
pip install bert-score
Imports
- score
from bert_score import score
Quickstart
from bert_score import score
cands = ["The cat sat on the mat.", "The dog ate the food."]
refs = [["The cat was on the mat."], ["A dog consumed the meal."]]
P, R, F1 = score(cands, refs, lang="en", verbose=True)
print(f"Precision: {P.mean().item():.3f}")
print(f"Recall: {R.mean().item():.3f}")
print(f"F1 Score: {F1.mean().item():.3f}")