HumanEval Benchmark for Code Generation

1.0.3 · active · verified Sun Apr 12

HumanEval is a benchmark developed by OpenAI for assessing the code generation capabilities of Large Language Models (LLMs). It comprises 164 hand-written Python programming problems, each with a function signature, docstring, and comprehensive unit tests, designed to evaluate functional correctness. The library uses the `pass@k` metric for evaluation. The current version is 1.0.3, released on July 24, 2023. As a benchmark dataset and evaluation harness, it has an infrequent release cadence, with updates typically driven by new research or significant improvements to the benchmark itself.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates the core workflow: reading HumanEval problems, generating code completions using a placeholder function (which you would replace with your LLM integration), saving these completions to a JSONL file, and finally, instructing how to run the `evaluate_functional_correctness` command-line tool to assess the functional correctness of the generated code.

import os
import json
from human_eval.data import write_jsonl, read_problems

def generate_one_completion(prompt: str) -> str:
    """A placeholder for your LLM's code generation function.
    Replace this with actual API calls to your LLM.
    It should take a problem prompt and return a generated code string.
    """
    # Example: A simple dummy completion for demonstration
    if "def multiply" in prompt:
        return "def multiply(a, b):\n    return a * b"
    elif "def add" in prompt:
        return "def add(a, b):\n    return a + b"
    else:
        return "def solution():\n    pass # Your LLM generated code here"


# 1. Read the HumanEval problems
problems = read_problems()

# 2. Generate completions for each problem
num_samples_per_task = 1 # For quick demonstration, typically >100 for robust eval
samples = []
for task_id in problems:
    prompt = problems[task_id]["prompt"]
    for _ in range(num_samples_per_task):
        completion = generate_one_completion(prompt)
        samples.append(dict(task_id=task_id, completion=completion))

# 3. Save the generated samples to a JSON Lines file
samples_filepath = "samples.jsonl"
write_jsonl(samples_filepath, samples)
print(f"Generated samples saved to {samples_filepath}")

# 4. Evaluate functional correctness (typically run as a separate command-line step)
print("\nTo evaluate, run the following from your terminal (after installing human-eval):")
print(f"$ evaluate_functional_correctness {samples_filepath}")
print("\nWARNING: This command executes untrusted model-generated code. Ensure you are in a robust security sandbox.")

# Example of output from evaluate_functional_correctness (if run separately):
# {'pass@1': ..., 'pass@10': ..., 'pass@100': ...}

view raw JSON →