Getting Started
SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.
🔍 All of the Projects
Check out the other projects that are part of the SWE-bench ecosystem!
🏆 Leaderboard
You can find the full leaderboard at swebench.com!
📋 Overview
SWE-bench provides:
- ✅ Real-world GitHub issues - Evaluate LLMs on actual software engineering tasks
- ✅ Reproducible evaluation - Docker-based evaluation harness for consistent results
- ✅ Multiple datasets - SWE-bench, SWE-bench Lite, SWE-bench Verified, and SWE-bench Multimodal
📰 Latest News
- [Jan. 13, 2025]: SWE-bench Multimodal integration with private test split evaluation
- [Jan. 11, 2025]: Cloud-based evaluations via Modal
- [Aug. 13, 2024]: SWE-bench Verified release with 500 engineer-confirmed solvable problems
- [Jun. 27, 2024]: Fully containerized evaluation harness using Docker
- [Apr. 2, 2024]: SWE-agent release with state-of-the-art results
- [Jan. 16, 2024]: SWE-bench accepted to ICLR 2024 as an oral presentation
🚀 Quick Start
# Access SWE-bench via Hugging Face
from datasets import load_dataset
swebench = load_dataset('princeton-nlp/SWE-bench', split='test')
# Setup with Docker
git clone git@github.com:princeton-nlp/SWE-bench.git
cd SWE-bench
pip install -e .
📚 Documentation Structure
- Installation - Setup instructions for local and cloud environments
- Guides
- Quickstart - Get started with SWE-bench
- Evaluation - How to evaluate models on SWE-bench
- Docker Setup - Configure Docker for SWE-bench
- Datasets - Available datasets and how to use them
- Create RAG Datasets - Build your own retrieval datasets
- Reference
- Harness API - Documentation for the evaluation harness
- Inference API - Documentation for model inference
- Versioning - Documentation for versioning
- FAQ - Frequently asked questions
⬇️ Available Resources
💫 Contributing
We welcome contributions from the NLP, Machine Learning, and Software Engineering communities! Please check our contributing guidelines for details.
✍️ Citation
@inproceedings{
jimenez2024swebench,
title={{SWE}-bench: Can Language Models Resolve Real-world Github Issues?},
author={Carlos E Jimenez and John Yang and Alexander Wettig and Shunyu Yao and Kexin Pei and Ofir Press and Karthik R Narasimhan},
booktitle={The Twelfth International Conference on Learning Representations},
year={2024},
url={https://openreview.net/forum?id=VTF8yNQM66}
}
@inproceedings{
yang2024swebenchmultimodal,
title={{SWE}-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?},
author={John Yang and Carlos E. Jimenez and Alex L. Zhang and Kilian Lieret and Joyce Yang and Xindi Wu and Ori Press and Niklas Muennighoff and Gabriel Synnaeve and Karthik R. Narasimhan and Diyi Yang and Sida I. Wang and Ofir Press},
booktitle={The Thirteenth International Conference on Learning Representations},
year={2025},
url={https://openreview.net/forum?id=riTiq3i21b}
}