LM Evaluation Harness

0.4.11 · active · verified Sat Apr 11

LM Evaluation Harness (lm-eval) is a comprehensive framework for evaluating language models on a wide range of benchmarks and tasks. It supports various model backends (HuggingFace, vLLM, SGLang, etc.) and provides a standardized way to compare model performance. The current version is 0.4.11, and it maintains a rapid release cadence with frequent minor updates and occasional breaking changes.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to initialize a HuggingFace model, select evaluation tasks, and run the evaluation using the Python API. It uses a small `tiny-gpt2` model on CPU for fast execution, which should be replaced with a more powerful model and GPU for meaningful results.

import os
from lm_eval import models, tasks, evaluator

# NOTE: For quickstart, we use a small model and CPU. 
# For real evaluations, use a GPU and a larger model.
# You might need to install 'lm-eval[hf]' or 'lm-eval[main]'

# Setup a model (e.g., HuggingFace model)
# Using a tiny model for quick execution, replace with desired model
model_name = "sshleifer/tiny-gpt2"
lm = models.get_model("hf", pretrained=model_name, device="cpu")

# Select tasks (e.g., 'hellaswag')
task_names = ["hellaswag"]
task_dict = tasks.get_task_dict(task_names)

# Evaluate the model
results = evaluator.evaluate(
    lm=lm,
    task_dict=task_dict,
    num_fewshot=0, # Number of few-shot examples (0 for zero-shot)
    batch_size=None, # Auto-batching
    device="cpu", # Or "cuda:0" for GPU
    limit=10 # Limit number of samples for quick testing
)

print(results)

view raw JSON →