SWE-bench
The official SWE-bench package (current version 4.1.0) provides a benchmark for evaluating large language models (LLMs) on software engineering tasks. It focuses on automatically testing model-generated code fixes against real-world software bugs and is actively developed with frequent updates, often involving significant changes between major versions.
Warnings
- breaking SWE-bench v4.0.0 introduced significant breaking changes related to how Docker environments are specified and managed. If upgrading from earlier versions (e.g., v3.x), review the new Docker integration patterns.
- breaking SWE-bench v3.0.0 included a major refactor with breaking changes to how environments are specified and built for task evaluation. Code relying on older environment configuration schemas will likely fail.
- gotcha While `pip install swebench` installs the core library, running actual task evaluations (which involves building and testing code environments) strictly requires `conda` and `docker` to be pre-installed and properly configured on your system.
- gotcha The SWE-bench benchmark dataset itself is not included with the `pip` package. It must be separately downloaded using the `swebench download` CLI command before you can programmatically access tasks using `get_tasks`.
Install
-
pip install swebench
Imports
- get_tasks
from swebench import get_tasks
- SWEBenchRunner
from swebench.harness.runner import SWEBenchRunner
- ModelEngine
from swebench.harness.engine_wrappers import ModelEngine
Quickstart
import os
from swebench import get_tasks
# --- Quickstart: Accessing SWE-bench data ---
# Note: SWE-bench data must be downloaded separately using the CLI:
# `swebench download`
# This command typically creates a 'data' directory in your current working directory.
# Adjust data_path if your data is located elsewhere (e.g., specific split like lite).
data_path = os.path.join(os.getcwd(), 'data', 'default_swebench_tasks.json')
# You might also want to use 'lite_swebench_tasks.json' for the smaller lite split.
tasks = []
try:
# Attempt to load tasks from the specified path
tasks = get_tasks(data_path=data_path)
print(f"Successfully loaded {len(tasks)} tasks from {data_path}")
if tasks:
print("\nExample task structure (first task):")
# Print a subset of a task's keys for brevity
first_task = tasks[0]
for key in ['repo', 'pull_request', 'instance_id', 'problem_statement', 'base_commit']:
if key in first_task:
print(f" {key}: {first_task[key][:100]}{'...' if len(first_task[key]) > 100 else ''}")
except FileNotFoundError:
print(f"Error: Data file not found at {data_path}.")
print("Please ensure you have run `swebench download` in your terminal.")
print("Or specify the correct path to your downloaded SWE-bench JSON data.")
except Exception as e:
print(f"An unexpected error occurred: {e}")
# --- Further steps (beyond this quickstart): ---
# For running a full evaluation, you would typically initialize a `SWEBenchRunner`
# and integrate a `ModelEngine` to test your LLM's code generation.
# This process heavily relies on pre-installed 'conda' and 'docker' for
# environment creation and isolated task execution.