LM Evaluation Harness
LM Evaluation Harness (lm-eval) is a comprehensive framework for evaluating language models on a wide range of benchmarks and tasks. It supports various model backends (HuggingFace, vLLM, SGLang, etc.) and provides a standardized way to compare model performance. The current version is 0.4.11, and it maintains a rapid release cadence with frequent minor updates and occasional breaking changes.
Warnings
- breaking The base `pip install lm_eval` no longer includes model backends (e.g., HuggingFace/PyTorch stack) by default. These must now be installed explicitly.
- breaking Python 3.10 or newer is now the minimum required version.
- breaking Chat template delimiter handling changed, particularly affecting multiple-choice tasks. This might alter how prompts are constructed for models expecting specific chat formats.
- gotcha Task versions can change between releases. Results from a previous task version may not be directly comparable with results from an updated version.
Install
-
pip install "lm-eval[main]" -
pip install lm-eval # Core only pip install "lm-eval[hf]" # Add HuggingFace backend
Imports
- models.get_model
from lm_eval import models
- tasks.get_task_dict
from lm_eval import tasks
- evaluator.evaluate
from lm_eval import evaluator
Quickstart
import os
from lm_eval import models, tasks, evaluator
# NOTE: For quickstart, we use a small model and CPU.
# For real evaluations, use a GPU and a larger model.
# You might need to install 'lm-eval[hf]' or 'lm-eval[main]'
# Setup a model (e.g., HuggingFace model)
# Using a tiny model for quick execution, replace with desired model
model_name = "sshleifer/tiny-gpt2"
lm = models.get_model("hf", pretrained=model_name, device="cpu")
# Select tasks (e.g., 'hellaswag')
task_names = ["hellaswag"]
task_dict = tasks.get_task_dict(task_names)
# Evaluate the model
results = evaluator.evaluate(
lm=lm,
task_dict=task_dict,
num_fewshot=0, # Number of few-shot examples (0 for zero-shot)
batch_size=None, # Auto-batching
device="cpu", # Or "cuda:0" for GPU
limit=10 # Limit number of samples for quick testing
)
print(results)