{"id":4568,"library":"human-eval","title":"HumanEval Benchmark for Code Generation","description":"HumanEval is a benchmark developed by OpenAI for assessing the code generation capabilities of Large Language Models (LLMs). It comprises 164 hand-written Python programming problems, each with a function signature, docstring, and comprehensive unit tests, designed to evaluate functional correctness. The library uses the `pass@k` metric for evaluation. The current version is 1.0.3, released on July 24, 2023. As a benchmark dataset and evaluation harness, it has an infrequent release cadence, with updates typically driven by new research or significant improvements to the benchmark itself.","status":"active","version":"1.0.3","language":"en","source_language":"en","source_url":"https://github.com/openai/human-eval","tags":["LLM evaluation","code generation","benchmark","Python","functional correctness","pass@k","AI","machine learning"],"install":[{"cmd":"pip install human-eval","lang":"bash","label":"Install from PyPI"},{"cmd":"git clone https://github.com/openai/human-eval.git\ncd human-eval\npip install -e .","lang":"bash","label":"Install for development (after cloning)"}],"dependencies":[],"imports":[{"symbol":"read_problems","correct":"from human_eval.data import read_problems"},{"symbol":"write_jsonl","correct":"from human_eval.data import write_jsonl"},{"note":"This is a command-line utility, not a Python import.","symbol":"evaluate_functional_correctness (command-line)","correct":"evaluate_functional_correctness samples.jsonl"}],"quickstart":{"code":"import os\nimport json\nfrom human_eval.data import write_jsonl, read_problems\n\ndef generate_one_completion(prompt: str) -> str:\n    \"\"\"A placeholder for your LLM's code generation function.\n    Replace this with actual API calls to your LLM.\n    It should take a problem prompt and return a generated code string.\n    \"\"\"\n    # Example: A simple dummy completion for demonstration\n    if \"def multiply\" in prompt:\n        return \"def multiply(a, b):\\n    return a * b\"\n    elif \"def add\" in prompt:\n        return \"def add(a, b):\\n    return a + b\"\n    else:\n        return \"def solution():\\n    pass # Your LLM generated code here\"\n\n\n# 1. Read the HumanEval problems\nproblems = read_problems()\n\n# 2. Generate completions for each problem\nnum_samples_per_task = 1 # For quick demonstration, typically >100 for robust eval\nsamples = []\nfor task_id in problems:\n    prompt = problems[task_id][\"prompt\"]\n    for _ in range(num_samples_per_task):\n        completion = generate_one_completion(prompt)\n        samples.append(dict(task_id=task_id, completion=completion))\n\n# 3. Save the generated samples to a JSON Lines file\nsamples_filepath = \"samples.jsonl\"\nwrite_jsonl(samples_filepath, samples)\nprint(f\"Generated samples saved to {samples_filepath}\")\n\n# 4. Evaluate functional correctness (typically run as a separate command-line step)\nprint(\"\\nTo evaluate, run the following from your terminal (after installing human-eval):\")\nprint(f\"$ evaluate_functional_correctness {samples_filepath}\")\nprint(\"\\nWARNING: This command executes untrusted model-generated code. Ensure you are in a robust security sandbox.\")\n\n# Example of output from evaluate_functional_correctness (if run separately):\n# {'pass@1': ..., 'pass@10': ..., 'pass@100': ...}","lang":"python","description":"This quickstart demonstrates the core workflow: reading HumanEval problems, generating code completions using a placeholder function (which you would replace with your LLM integration), saving these completions to a JSONL file, and finally, instructing how to run the `evaluate_functional_correctness` command-line tool to assess the functional correctness of the generated code."},"warnings":[{"fix":"Users are *strongly encouraged* to run the `evaluate_functional_correctness` tool within a robust security sandbox (e.g., Docker, a dedicated VM, or an isolated environment like Riza). Review `human_eval/execution.py` and uncomment the execution line only after understanding the risks and implementing proper sandboxing.","message":"Executing model-generated code carries significant security risks. The `execution.py` module in the `human-eval` library deliberately comments out the actual code execution call.","severity":"breaking","affected_versions":"All versions"},{"fix":"For optimal results, use instruction-tuned models. If using base models, a post-generation filtering step (`filter_code`) might be necessary to clean up outputs before evaluation.","message":"Base language models (not instruction-tuned) might produce repetitive or malformed outputs that can break benchmark scores. This is particularly common with chat completion APIs.","severity":"gotcha","affected_versions":"All versions"},{"fix":"To mitigate contamination, use time-split datasets, rigorously audit potential overlap sources, and compare `pass@k` scores on fresh, internal tasks.","message":"The HumanEval benchmark can be susceptible to 'contamination' where test problems or similar solutions might have been part of an LLM's training data, leading to artificially inflated scores.","severity":"gotcha","affected_versions":"All versions"},{"fix":"To speed up evaluation: utilize the `--parallel` flag (e.g., `--parallel $(nproc)`), avoid `--test-details` if only `pass@k` scores are needed (as it runs all tests even after failure), and consider using specialized versions like HumanEval+ Mini for faster checks.","message":"Evaluating a large number of samples or using the `--test-details` flag for `evaluate_functional_correctness` can be computationally intensive and slow.","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-12T00:00:00.000Z","next_check":"2026-07-11T00:00:00.000Z"}