{"id":4089,"library":"lm-eval","title":"LM Evaluation Harness","description":"LM Evaluation Harness (lm-eval) is a comprehensive framework for evaluating language models on a wide range of benchmarks and tasks. It supports various model backends (HuggingFace, vLLM, SGLang, etc.) and provides a standardized way to compare model performance. The current version is 0.4.11, and it maintains a rapid release cadence with frequent minor updates and occasional breaking changes.","status":"active","version":"0.4.11","language":"en","source_language":"en","source_url":"https://github.com/EleutherAI/lm-evaluation-harness","tags":["LLM","evaluation","NLP","machine-learning","benchmark"],"install":[{"cmd":"pip install \"lm-eval[main]\"","lang":"bash","label":"Install core with common backends (HF, PyTorch, vLLM)"},{"cmd":"pip install lm-eval # Core only\npip install \"lm-eval[hf]\" # Add HuggingFace backend","lang":"bash","label":"Install core then specific backend"}],"dependencies":[{"reason":"Required for HuggingFace backend models. Part of `[hf]` extra.","package":"torch","optional":true},{"reason":"Required for HuggingFace backend models. Part of `[hf]` extra.","package":"transformers","optional":true},{"reason":"Required for vLLM backend models. Part of `[vllm]` extra.","package":"vllm","optional":true}],"imports":[{"note":"Use `models.get_model` for a unified interface, avoids direct import of specific backend classes which might change.","wrong":"from lm_eval.models.huggingface import HFLM","symbol":"models.get_model","correct":"from lm_eval import models"},{"note":"Use `tasks.get_task_dict` to load tasks dynamically by name, which is more robust to internal structure changes.","wrong":"from lm_eval.tasks.hellaswag import HellaSwag","symbol":"tasks.get_task_dict","correct":"from lm_eval import tasks"},{"symbol":"evaluator.evaluate","correct":"from lm_eval import evaluator"}],"quickstart":{"code":"import os\nfrom lm_eval import models, tasks, evaluator\n\n# NOTE: For quickstart, we use a small model and CPU. \n# For real evaluations, use a GPU and a larger model.\n# You might need to install 'lm-eval[hf]' or 'lm-eval[main]'\n\n# Setup a model (e.g., HuggingFace model)\n# Using a tiny model for quick execution, replace with desired model\nmodel_name = \"sshleifer/tiny-gpt2\"\nlm = models.get_model(\"hf\", pretrained=model_name, device=\"cpu\")\n\n# Select tasks (e.g., 'hellaswag')\ntask_names = [\"hellaswag\"]\ntask_dict = tasks.get_task_dict(task_names)\n\n# Evaluate the model\nresults = evaluator.evaluate(\n    lm=lm,\n    task_dict=task_dict,\n    num_fewshot=0, # Number of few-shot examples (0 for zero-shot)\n    batch_size=None, # Auto-batching\n    device=\"cpu\", # Or \"cuda:0\" for GPU\n    limit=10 # Limit number of samples for quick testing\n)\n\nprint(results)","lang":"python","description":"This quickstart demonstrates how to initialize a HuggingFace model, select evaluation tasks, and run the evaluation using the Python API. It uses a small `tiny-gpt2` model on CPU for fast execution, which should be replaced with a more powerful model and GPU for meaningful results."},"warnings":[{"fix":"Install with extras: `pip install \"lm-eval[main]\"` for common backends, or `pip install \"lm-eval[hf]\"` for HuggingFace, `\"lm-eval[vllm]\"` for vLLM, etc.","message":"The base `pip install lm_eval` no longer includes model backends (e.g., HuggingFace/PyTorch stack) by default. These must now be installed explicitly.","severity":"breaking","affected_versions":">=0.4.10"},{"fix":"Upgrade your Python environment to 3.10 or later.","message":"Python 3.10 or newer is now the minimum required version.","severity":"breaking","affected_versions":">=0.4.9.2 (Python 3.8 support was dropped in v0.4.8)"},{"fix":"Review your chat template configurations and model expectations, especially for tasks sensitive to prompt formatting. Test against previous versions if possible.","message":"Chat template delimiter handling changed, particularly affecting multiple-choice tasks. This might alter how prompts are constructed for models expecting specific chat formats.","severity":"breaking","affected_versions":">=0.4.6"},{"fix":"Always note the `lm-eval` version and specific task versions when reporting or comparing results. Consult release notes or task definitions for version changes.","message":"Task versions can change between releases. Results from a previous task version may not be directly comparable with results from an updated version.","severity":"gotcha","affected_versions":"All versions (ongoing concern)"}],"env_vars":null,"last_verified":"2026-04-11T00:00:00.000Z","next_check":"2026-07-10T00:00:00.000Z"}