Hugging Face Evaluate
Evaluate is a Hugging Face community-driven open-source library providing a standardized interface for accessing and comparing over 80+ evaluation metrics, datasets, and models. It simplifies the process of evaluating machine learning models, offering a consistent API for various tasks. The library is actively maintained with frequent patch releases, currently at version 0.4.6.
Warnings
- breaking The `use_auth_token` parameter has been deprecated across the Hugging Face ecosystem, including `evaluate`. It has been replaced by `token` for authentication.
- breaking As of v0.4.6, `evaluate` removed support for the deprecated `HfFolder` class from `huggingface_hub`. This change adds support for `huggingface_hub>=1.0`.
- gotcha Many evaluation metrics and evaluators within the `evaluate` library have external dependencies that are not installed by default. These include `nltk`, `scikit-learn`, `transformers`, `datasets`, `jiwer`, etc.
- gotcha The `evaluate` library uses caching mechanisms for loaded metrics and sometimes for computation results. While beneficial for performance, this can lead to unexpected behavior if you modify metric parameters or source data without properly clearing or understanding the cache.
Install
-
pip install evaluate -
pip install evaluate[full]
Imports
- load
import evaluate metric = evaluate.load("accuracy")
Quickstart
import evaluate
# Load an evaluation metric
accuracy_metric = evaluate.load("accuracy")
# Prepare dummy predictions and references
predictions = [0, 1, 0, 1, 0]
references = [0, 1, 1, 0, 0]
# Compute the metric
results = accuracy_metric.compute(predictions=predictions, references=references)
print(f"Accuracy results: {results}")
# Load a metric from the Hub that requires a token (e.g., for private models/datasets)
# TOKEN = os.environ.get('HF_TOKEN', '')
# if TOKEN:
# # Example for a metric that might need a token, like an 'evaluator'
# # Note: 'accuracy' itself doesn't require a token for basic use.
# # Let's simulate loading an evaluator that might need one.
# # For actual evaluators or private resources, `token=TOKEN` would be passed.
# # For this example, we'll just demonstrate the token parameter concept.
# # evaluator = evaluate.load("text_classification", model_or_pipeline="username/my-private-model", token=TOKEN)
# # print("Evaluator loaded with token (concept shown).")
# pass # Placeholder, as 'accuracy' doesn't use token for compute
# Example with specific configuration (e.g., for BERTScore)
# bertscore_metric = evaluate.load("bertscore")
# predictions_text = ["The cat sat on the mat.", "The dog ate the bone."]
# references_text = [["A cat was on the mat."], ["A dog consumed the bone."]]
# bertscore_results = bertscore_metric.compute(predictions=predictions_text, references=references_text, lang="en")
# print(f"BERTScore results (first example): {bertscore_results['f1'][0]}")