SWE-bench Datasets
SWE-bench offers multiple datasets for evaluating language models on software engineering tasks. This guide explains the different datasets and how to use them.
Available Datasets
SWE-bench provides several dataset variants:
Dataset | Description | Size | Use Case |
---|---|---|---|
SWE-bench | Full benchmark with diverse repositories | 2,294 instances | Comprehensive evaluation |
SWE-bench Lite | Smaller subset for quick evaluations | 534 instances | Faster iteration, development |
SWE-bench Verified | Expert-verified solvable problems | 500 instances | High-quality evaluation |
SWE-bench Multimodal | Includes screenshots and UI elements | 100 dev instances (500 test) | Testing multimodal capabilities |
Accessing Datasets
All datasets are available on Hugging Face:
from datasets import load_dataset
# Load main dataset
sbf = load_dataset('princeton-nlp/SWE-bench')
# Load lite variant
sbl = load_dataset('princeton-nlp/SWE-bench_Lite')
# Load verified variant
sbv = load_dataset('princeton-nlp/SWE-bench_Verified', split='test')
# Load multimodal variant
sbm_dev = load_dataset('princeton-nlp/SWE-bench_Multimodal', split='dev')
sbm_test = load_dataset('princeton-nlp/SWE-bench_Multimodal', split='test')
Dataset Structure
Each instance in the datasets has the following structure:
{
"instance_id": "owner__repo-pr_number",
"repo": "owner/repo",
"issue_id": issue_number,
"base_commit": "commit_hash",
"problem_statement": "Issue description...",
"version": "Repository package version",
"issue_url": "GitHub issue URL",
"pr_url": "GitHub pull request URL",
"patch": "Gold solution patch (don't look at this if you're trying to solve the problem)",
"test_patch": "Test patch",
"created_at": "Date of creation",
"FAIL_TO_PASS": "Fail to pass test cases",
"PASS_TO_PASS": "Pass test cases"
}
SWE-bench Verified also includes:
{
# ... standard fields above ...
"difficulty": "Difficulty level"
}
The multimodal dataset also includes:
{
# ... standard fields above ...
"image_assets": {
"problem_statement": ["url1", "url2", ...],
"patch": ["url1", "url2", ...],
"test_patch": ["url1", "url2", ...]
}
}
Note that for the test
split of the multimodal dataset, the patch
, test_patch
, and image_assets
fields will be empty.
Paper's Retrieval Datasets
For the BM25 retrieval datasets used in the SWE-bench paper, you can load the datasets as follows:
# Load oracle retrieval dataset
oracle_retrieval = load_dataset('princeton-nlp/SWE-bench_oracle', split='test')
# Load BM25 retrieval dataset
sbf_bm25_13k = load_dataset('princeton-nlp/SWE-bench_bm25_13K', split='test')
sbf_bm25_27k = load_dataset('princeton-nlp/SWE-bench_bm25_27K', split='test')
sbf_bm25_40k = load_dataset('princeton-nlp/SWE-bench_bm25_40K', split='test')