HumanEval Benchmark for Code Generation
HumanEval is a benchmark developed by OpenAI for assessing the code generation capabilities of Large Language Models (LLMs). It comprises 164 hand-written Python programming problems, each with a function signature, docstring, and comprehensive unit tests, designed to evaluate functional correctness. The library uses the `pass@k` metric for evaluation. The current version is 1.0.3, released on July 24, 2023. As a benchmark dataset and evaluation harness, it has an infrequent release cadence, with updates typically driven by new research or significant improvements to the benchmark itself.
Warnings
- breaking Executing model-generated code carries significant security risks. The `execution.py` module in the `human-eval` library deliberately comments out the actual code execution call.
- gotcha Base language models (not instruction-tuned) might produce repetitive or malformed outputs that can break benchmark scores. This is particularly common with chat completion APIs.
- gotcha The HumanEval benchmark can be susceptible to 'contamination' where test problems or similar solutions might have been part of an LLM's training data, leading to artificially inflated scores.
- gotcha Evaluating a large number of samples or using the `--test-details` flag for `evaluate_functional_correctness` can be computationally intensive and slow.
Install
-
pip install human-eval -
git clone https://github.com/openai/human-eval.git cd human-eval pip install -e .
Imports
- read_problems
from human_eval.data import read_problems
- write_jsonl
from human_eval.data import write_jsonl
- evaluate_functional_correctness (command-line)
evaluate_functional_correctness samples.jsonl
Quickstart
import os
import json
from human_eval.data import write_jsonl, read_problems
def generate_one_completion(prompt: str) -> str:
"""A placeholder for your LLM's code generation function.
Replace this with actual API calls to your LLM.
It should take a problem prompt and return a generated code string.
"""
# Example: A simple dummy completion for demonstration
if "def multiply" in prompt:
return "def multiply(a, b):\n return a * b"
elif "def add" in prompt:
return "def add(a, b):\n return a + b"
else:
return "def solution():\n pass # Your LLM generated code here"
# 1. Read the HumanEval problems
problems = read_problems()
# 2. Generate completions for each problem
num_samples_per_task = 1 # For quick demonstration, typically >100 for robust eval
samples = []
for task_id in problems:
prompt = problems[task_id]["prompt"]
for _ in range(num_samples_per_task):
completion = generate_one_completion(prompt)
samples.append(dict(task_id=task_id, completion=completion))
# 3. Save the generated samples to a JSON Lines file
samples_filepath = "samples.jsonl"
write_jsonl(samples_filepath, samples)
print(f"Generated samples saved to {samples_filepath}")
# 4. Evaluate functional correctness (typically run as a separate command-line step)
print("\nTo evaluate, run the following from your terminal (after installing human-eval):")
print(f"$ evaluate_functional_correctness {samples_filepath}")
print("\nWARNING: This command executes untrusted model-generated code. Ensure you are in a robust security sandbox.")
# Example of output from evaluate_functional_correctness (if run separately):
# {'pass@1': ..., 'pass@10': ..., 'pass@100': ...}