SWE-smith

0.0.9 · active · verified Fri Apr 10

SWE-smith is an open-source Python toolkit designed for generating large-scale software engineering training data. It enables users to turn any GitHub repository into a 'SWE-gym' to create unlimited task instances (e.g., file localization, program repair, SWE-bench) for training Software Engineering (SWE) agents. The current version is 0.0.9, and it appears to be actively developed, with frequent updates and an upcoming NeurIPS 2025 Datasets & Benchmarks Track spotlight. [2, 4, 6]

Warnings

Install

Imports

Quickstart

This quickstart demonstrates how to load a SWE-smith dataset using the `datasets` library and retrieve the `RepoProfile` for a given task instance. It outlines the initial steps for interacting with SWE-smith generated data, typically leading to environment creation and agent training. Note that full execution, particularly `rp.get_container(task)`, requires Docker to be running. [4]

# Example: Loading a SWE-smith dataset and getting a RepoProfile
# Requires 'datasets' to be installed (pip install datasets)
import os
from datasets import load_dataset
from swesmith.profiles import registry

# NOTE: This example requires Docker to be running for environment creation
# and may download a large dataset. Authentication (e.g., Hugging Face token)
# might be needed depending on dataset access.

# Load a small sample of the SWE-smith dataset
try:
    ds = load_dataset("SWE-bench/SWE-smith", split="train", streaming=True)
    print("Dataset loaded successfully. Processing first few tasks...")
    
    count = 0
    for task in ds:
        if count >= 2:  # Process only the first 2 tasks for quickstart
            break
        print(f"\n--- Processing Task {count + 1} ---")
        print(f"Task ID: {task.get('instance_id', 'N/A')}")
        
        # Get the RepoProfile for the task
        rp = registry.get_from_inst(task)
        print(f"Repository Profile for task: {rp.repo_name}")
        
        # Get a pointer to a Docker container with the task initialized (requires Docker)
        # This step will actually attempt to create/get a Docker container
        # Skipping actual container interaction for a simple quickstart printout.
        # container = rp.get_container(task)
        # print(f"Container ID for task: {container.id}")
        print("To get the Docker container, uncomment 'container = rp.get_container(task)'")
        count += 1
except Exception as e:
    print(f"An error occurred during quickstart: {e}")
    print("Please ensure Docker is running and 'datasets' is installed. "
          "If using a private dataset, ensure you are logged in (e.g., huggingface-cli login).")

view raw JSON →