Hugging Face Evaluate

0.4.6 · active · verified Thu Apr 09

Evaluate is a Hugging Face community-driven open-source library providing a standardized interface for accessing and comparing over 80+ evaluation metrics, datasets, and models. It simplifies the process of evaluating machine learning models, offering a consistent API for various tasks. The library is actively maintained with frequent patch releases, currently at version 0.4.6.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a metric (e.g., 'accuracy') using `evaluate.load()` and then compute its results with sample predictions and references. While 'accuracy' doesn't require authentication, the commented section illustrates how an `HF_TOKEN` environment variable would be used with the `token` parameter for metrics or evaluators requiring access to private resources on the Hugging Face Hub. Remember that many metrics require specific input formats (e.g., text for BERTScore) and may need additional `pip install` commands for their dependencies.

import evaluate

# Load an evaluation metric
accuracy_metric = evaluate.load("accuracy")

# Prepare dummy predictions and references
predictions = [0, 1, 0, 1, 0]
references = [0, 1, 1, 0, 0]

# Compute the metric
results = accuracy_metric.compute(predictions=predictions, references=references)
print(f"Accuracy results: {results}")

# Load a metric from the Hub that requires a token (e.g., for private models/datasets)
# TOKEN = os.environ.get('HF_TOKEN', '')
# if TOKEN:
#     # Example for a metric that might need a token, like an 'evaluator'
#     # Note: 'accuracy' itself doesn't require a token for basic use.
#     # Let's simulate loading an evaluator that might need one.
#     # For actual evaluators or private resources, `token=TOKEN` would be passed.
#     # For this example, we'll just demonstrate the token parameter concept.
#     # evaluator = evaluate.load("text_classification", model_or_pipeline="username/my-private-model", token=TOKEN)
#     # print("Evaluator loaded with token (concept shown).")
#     pass # Placeholder, as 'accuracy' doesn't use token for compute

# Example with specific configuration (e.g., for BERTScore)
# bertscore_metric = evaluate.load("bertscore")
# predictions_text = ["The cat sat on the mat.", "The dog ate the bone."]
# references_text = [["A cat was on the mat."], ["A dog consumed the bone."]]
# bertscore_results = bertscore_metric.compute(predictions=predictions_text, references=references_text, lang="en")
# print(f"BERTScore results (first example): {bertscore_results['f1'][0]}")

view raw JSON →