{"id":2797,"library":"swesmith","title":"SWE-smith","description":"SWE-smith is an open-source Python toolkit designed for generating large-scale software engineering training data. It enables users to turn any GitHub repository into a 'SWE-gym' to create unlimited task instances (e.g., file localization, program repair, SWE-bench) for training Software Engineering (SWE) agents. The current version is 0.0.9, and it appears to be actively developed, with frequent updates and an upcoming NeurIPS 2025 Datasets & Benchmarks Track spotlight. [2, 4, 6]","status":"active","version":"0.0.9","language":"en","source_language":"en","source_url":"https://github.com/SWE-bench/SWE-smith","tags":["software engineering","AI agents","training data","bug generation","benchmarking","docker"],"install":[{"cmd":"pip install swesmith","lang":"bash","label":"PyPI"},{"cmd":"git clone git@github.com:SWE-bench/SWE-smith.git\ncd SWE-smith\npip install -e '.[all]'","lang":"bash","label":"From Source (with all dependencies)"}],"dependencies":[{"reason":"Required to create execution environments for repositories.","package":"docker","optional":false},{"reason":"Commonly used for loading SWE-smith datasets, e.g., 'SWE-bench/SWE-smith'.","package":"datasets","optional":true},{"reason":"Used for validation and evaluation in conjunction with SWE-smith.","package":"swebench","optional":true}],"imports":[{"note":"This is a common import for accessing repository profiles when working with SWE-smith datasets. [4]","symbol":"registry","correct":"from swesmith.profiles import registry"}],"quickstart":{"code":"# Example: Loading a SWE-smith dataset and getting a RepoProfile\n# Requires 'datasets' to be installed (pip install datasets)\nimport os\nfrom datasets import load_dataset\nfrom swesmith.profiles import registry\n\n# NOTE: This example requires Docker to be running for environment creation\n# and may download a large dataset. Authentication (e.g., Hugging Face token)\n# might be needed depending on dataset access.\n\n# Load a small sample of the SWE-smith dataset\ntry:\n    ds = load_dataset(\"SWE-bench/SWE-smith\", split=\"train\", streaming=True)\n    print(\"Dataset loaded successfully. Processing first few tasks...\")\n    \n    count = 0\n    for task in ds:\n        if count >= 2:  # Process only the first 2 tasks for quickstart\n            break\n        print(f\"\\n--- Processing Task {count + 1} ---\")\n        print(f\"Task ID: {task.get('instance_id', 'N/A')}\")\n        \n        # Get the RepoProfile for the task\n        rp = registry.get_from_inst(task)\n        print(f\"Repository Profile for task: {rp.repo_name}\")\n        \n        # Get a pointer to a Docker container with the task initialized (requires Docker)\n        # This step will actually attempt to create/get a Docker container\n        # Skipping actual container interaction for a simple quickstart printout.\n        # container = rp.get_container(task)\n        # print(f\"Container ID for task: {container.id}\")\n        print(\"To get the Docker container, uncomment 'container = rp.get_container(task)'\")\n        count += 1\nexcept Exception as e:\n    print(f\"An error occurred during quickstart: {e}\")\n    print(\"Please ensure Docker is running and 'datasets' is installed. \"\n          \"If using a private dataset, ensure you are logged in (e.g., huggingface-cli login).\")","lang":"python","description":"This quickstart demonstrates how to load a SWE-smith dataset using the `datasets` library and retrieve the `RepoProfile` for a given task instance. It outlines the initial steps for interacting with SWE-smith generated data, typically leading to environment creation and agent training. Note that full execution, particularly `rp.get_container(task)`, requires Docker to be running. [4]"},"warnings":[{"fix":"Ensure Docker is installed and running, and preferably use a Linux-based OS like Ubuntu 22.04.4 LTS for development. [4]","message":"SWE-smith relies heavily on Docker for creating and managing execution environments. Lack of Docker or running on unsupported OS (like Windows/MacOS directly) can lead to unexpected behavior or prevent functionality. [4]","severity":"breaking","affected_versions":"All versions"},{"fix":"Refer to the official documentation and quickstart guides for the correct command-line usage patterns for different SWE-smith workflows. [1, 10]","message":"The primary interaction model often involves running specific modules via `python -m swesmith.module.submodule` for tasks like bug generation, validation, or environment building, rather than direct class instantiation and method calls for core workflows. [1, 10]","severity":"gotcha","affected_versions":"All versions"},{"fix":"Check the latest documentation and GitHub repository for specific language support details and available features for your target language. [6, 8]","message":"While the core library is Python-focused, SWE-smith is expanding to support other programming languages (Go, JavaScript, Rust, C, C++, C#, Java, PHP). Be aware that full functionality and bug generation strategies might differ or be under active development for non-Python languages. [6, 8]","severity":"gotcha","affected_versions":"All versions"}],"env_vars":null,"last_verified":"2026-04-10T00:00:00.000Z","next_check":"2026-07-09T00:00:00.000Z"}