SWE-bench

4.1.0 · active · verified Thu Apr 09

The official SWE-bench package (current version 4.1.0) provides a benchmark for evaluating large language models (LLMs) on software engineering tasks. It focuses on automatically testing model-generated code fixes against real-world software bugs and is actively developed with frequent updates, often involving significant changes between major versions.

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to programmatically load SWE-bench tasks after downloading the dataset using the `swebench download` CLI command. It prints basic information about the loaded tasks or guides the user if the data isn't found. Full evaluation with `SWEBenchRunner` and `ModelEngine` requires `conda` and `docker`.

import os
from swebench import get_tasks

# --- Quickstart: Accessing SWE-bench data ---
# Note: SWE-bench data must be downloaded separately using the CLI:
# `swebench download`
# This command typically creates a 'data' directory in your current working directory.
# Adjust data_path if your data is located elsewhere (e.g., specific split like lite).
data_path = os.path.join(os.getcwd(), 'data', 'default_swebench_tasks.json')
# You might also want to use 'lite_swebench_tasks.json' for the smaller lite split.

tasks = []
try:
    # Attempt to load tasks from the specified path
    tasks = get_tasks(data_path=data_path)
    print(f"Successfully loaded {len(tasks)} tasks from {data_path}")
    if tasks:
        print("\nExample task structure (first task):")
        # Print a subset of a task's keys for brevity
        first_task = tasks[0]
        for key in ['repo', 'pull_request', 'instance_id', 'problem_statement', 'base_commit']:
            if key in first_task:
                print(f"  {key}: {first_task[key][:100]}{'...' if len(first_task[key]) > 100 else ''}")
except FileNotFoundError:
    print(f"Error: Data file not found at {data_path}.")
    print("Please ensure you have run `swebench download` in your terminal.")
    print("Or specify the correct path to your downloaded SWE-bench JSON data.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

# --- Further steps (beyond this quickstart): ---
# For running a full evaluation, you would typically initialize a `SWEBenchRunner`
# and integrate a `ModelEngine` to test your LLM's code generation.
# This process heavily relies on pre-installed 'conda' and 'docker' for
# environment creation and isolated task execution.

view raw JSON →