{"id":3907,"library":"bert-score","title":"BERTScore","description":"BERTScore is a Python library that provides a PyTorch implementation of the BERTScore metric, a robust evaluation metric for text generation tasks. It leverages pre-trained BERT embeddings to compute a similarity score between generated and reference texts, addressing limitations of traditional metrics like BLEU. The library is actively maintained with regular updates and is currently at version 0.3.13.","status":"active","version":"0.3.13","language":"en","source_language":"en","source_url":"https://github.com/Tiiiger/bert_score","tags":["NLP","text-generation","evaluation","metric","BERT","PyTorch","HuggingFace"],"install":[{"cmd":"pip install bert-score","lang":"bash","label":"Install latest version"}],"dependencies":[{"reason":"Core deep learning framework for the BERTScore computations.","package":"torch"},{"reason":"Provides access to pre-trained BERT models and tokenizers.","package":"transformers"},{"reason":"Fundamental package for numerical computing.","package":"numpy"},{"reason":"Scientific computing library, potentially used for statistical operations.","package":"scipy"},{"reason":"For progress bar visualization.","package":"tqdm"},{"reason":"For downloading models.","package":"requests"},{"reason":"For handling version comparisons.","package":"packaging"}],"imports":[{"symbol":"score","correct":"from bert_score import score"}],"quickstart":{"code":"from bert_score import score\n\ncands = [\"The cat sat on the mat.\", \"The dog ate the food.\"]\nrefs = [[\"The cat was on the mat.\"], [\"A dog consumed the meal.\"]]\n\nP, R, F1 = score(cands, refs, lang=\"en\", verbose=True)\n\nprint(f\"Precision: {P.mean().item():.3f}\")\nprint(f\"Recall: {R.mean().item():.3f}\")\nprint(f\"F1 Score: {F1.mean().item():.3f}\")","lang":"python","description":"This example calculates BERTScore (Precision, Recall, and F1) between a list of candidate sentences and reference sentences. The `lang` parameter specifies the language model to use, and `verbose=True` provides detailed output during computation."},"warnings":[{"fix":"Refer to the `bert-score` release notes and test with the `transformers` version specified as compatible. If encountering issues, try pinning `transformers` to a known working version (e.g., `pip install transformers<4.17.0` for `bert-score==0.3.12` or `pip install transformers>=4.17.0` for `bert-score==0.3.13`).","message":"BERTScore often requires specific versions of the `transformers` library. Incompatible versions can lead to `KeyError`, `AttributeError`, or incorrect scores due to changes in the underlying Hugging Face API.","severity":"breaking","affected_versions":"All versions, especially during `transformers` major updates. For instance, v0.3.7 fixed compatibility with transformers >=4.0.0, and v0.3.13 fixed issues with transformers > 4.17.0."},{"fix":"Update CLI calls to use `--rescale_with_baseline` instead of `--rescale-with-baseline`.","message":"The command-line option for rescaling with a baseline was changed from `--rescale-with-baseline` to `--rescale_with_baseline` for consistency with other options.","severity":"breaking","affected_versions":"CLI usage in versions v0.3.6 and later."},{"fix":"Be aware that enabling fast tokenizers may alter scores. For reproducibility, explicitly set `use_fast_tokenizer` to `True` or `False` based on your desired behavior and consistently use the same setting. Compare scores with and without this option if consistency with previous results is critical.","message":"Using the `--use_fast_tokenizer` option (or `use_fast_tokenizer=True` in Python) with Hugging Face transformers can lead to different BERTScore results due to subtle differences in tokenizer implementations compared to default slow tokenizers.","severity":"gotcha","affected_versions":"v0.3.10 and later when `--use_fast_tokenizer` is enabled."},{"fix":"Consider specifying a `model_type` like `microsoft/deberta-xlarge-mnli` (or other recommended models from the Hugging Face Model Hub) explicitly to potentially achieve better evaluation correlations. Refer to the official documentation or release notes for the latest recommendations and model performance benchmarks.","message":"The default or recommended model for BERTScore has evolved. While RoBERTa is often the default, models like DeBERTa (e.g., `microsoft/deberta-xlarge-mnli`) have shown higher correlation with human scores on some datasets.","severity":"gotcha","affected_versions":"v0.3.8 and later."}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}