Getting Started

SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.

🔍 All of the Projects

Check out the other projects that are part of the SWE-bench ecosystem!

🏆 Leaderboard

You can find the full leaderboard at swebench.com!

📋 Overview

SWE-bench provides:

✅ Real-world GitHub issues - Evaluate LLMs on actual software engineering tasks
✅ Reproducible evaluation - Docker-based evaluation harness for consistent results
✅ Multiple datasets - SWE-bench, SWE-bench Lite, SWE-bench Verified, and SWE-bench Multimodal

📰 Latest News

[Jan. 13, 2025]: SWE-bench Multimodal integration with private test split evaluation
[Jan. 11, 2025]: Cloud-based evaluations via Modal
[Aug. 13, 2024]: SWE-bench Verified release with 500 engineer-confirmed solvable problems
[Jun. 27, 2024]: Fully containerized evaluation harness using Docker
[Apr. 2, 2024]: SWE-agent release with state-of-the-art results
[Jan. 16, 2024]: SWE-bench accepted to ICLR 2024 as an oral presentation

🚀 Quick Start

# Access SWE-bench via Hugging Face
from datasets import load_dataset
swebench = load_dataset('princeton-nlp/SWE-bench', split='test')

# Setup with Docker
git clone git@github.com:princeton-nlp/SWE-bench.git
cd SWE-bench
pip install -e .

📚 Documentation Structure

Installation - Setup instructions for local and cloud environments
Guides
Quickstart - Get started with SWE-bench
Evaluation - How to evaluate models on SWE-bench
Docker Setup - Configure Docker for SWE-bench
Datasets - Available datasets and how to use them
Create RAG Datasets - Build your own retrieval datasets
Reference
Harness API - Documentation for the evaluation harness
Inference API - Documentation for model inference
Versioning - Documentation for versioning
FAQ - Frequently asked questions

⬇️ Available Resources

Datasets	Models	RAG
💿 SWE-bench	🦙 SWE-Llama 13b	🤗 "Oracle" Retrieval
💿 SWE-bench Lite	🦙 SWE-Llama 13b (PEFT)	🤗 BM25 Retrieval 13K
💿 SWE-bench Verified	🦙 SWE-Llama 7b	🤗 BM25 Retrieval 27K
💿 SWE-bench Multimodal	🦙 SWE-Llama 7b (PEFT)	🤗 BM25 Retrieval 40K/50K

💫 Contributing

We welcome contributions from the NLP, Machine Learning, and Software Engineering communities! Please check our contributing guidelines for details.

✍️ Citation

@inproceedings{
    jimenez2024swebench,
    title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
    author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
    booktitle={The Twelfth International Conference on Learning Representations},
    year={2024},
    url={https://openreview.net/forum?id=VTF8yNQM66}
}

@inproceedings{
    yang2024swebenchmultimodal,
    title={{SWE}-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?},
    author={John Yang and Carlos E. Jimenez and Alex L. Zhang and Kilian Lieret and Joyce Yang and Xindi Wu and Ori Press and Niklas Muennighoff and Gabriel Synnaeve and Karthik R. Narasimhan and Diyi Yang and Sida I. Wang and Ofir Press},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=riTiq3i21b}
}