Frequently Asked Questions
General Questions
What is SWE-bench?
SWE-bench is a benchmark for evaluating large language models on real-world software engineering tasks. It consists of GitHub issues and their corresponding fixes, allowing LLMs to be evaluated on their ability to generate patches that resolve these issues.
Which datasets are available?
SWE-bench offers four main datasets: - SWE-bench: The full benchmark with 2,294 instances - SWE-bench Lite: A smaller subset with 534 instances - SWE-bench Verified: 500 instances verified by engineers as solvable - SWE-bench Multimodal: 100 development instances with screenshots and UI elements (test eval is on the SWE-bench API.)
How does the evaluation work?
The evaluation process: 1. Sets up a Docker environment for a repository 2. Applies the model's generated patch 3. Runs the repository's test suite 4. Determines if the patch successfully resolves the issue
What metrics are reported?
Key metrics include: - Total instances: Number of instances in the dataset - Instances submitted: Number of instances the model attempted - Instances completed: Number of instances that completed the evaluation process - Instances resolved: Number of instances where the model's patch fixed the issue - Instances unresolved: Number of instances where the evaluation completed but the issue wasn't fixed - Instances with empty patches: Number of instances where the model returned an empty patch - Instances with errors: Number of instances where the evaluation encountered errors - Resolution rate: The percentage of submitted instances that were successfully resolved
Installation and Setup
How can I save disk space when using Docker?
Use these commands to clean up Docker resources:
# Remove unused containers
docker container prune
# Remove unused images
docker image prune
# Remove all unused Docker objects
docker system prune
You can also set --cache_level=env
and --clean=True
when running swebench.harness.run_evaluation
to only dynamically remove instance images after they are used. This will make each run longer, but will use less disk space.
Using the Benchmark
How do I run evaluations on my own model?
Generate predictions in the required format and use the evaluation harness:
from swebench.harness.run_evaluation import run_evaluation
predictions = [
{
"instance_id": "repo_owner_name__repo_name-issue_number",
"model": "my-model-name",
"prediction": "code patch here"
}
]
results = run_evaluation(
predictions=predictions,
dataset_name="SWE-bench_Lite",
)
More effectively, you can save the predictions to a file and then run the evaluation harness on that file:
python -m swebench.harness.run_evaluation --predictions_path predictions.jsonl
Can I run evaluations without using Docker?
No. Docker is required for consistent evaluation environments. This ensures that the evaluation is reproducible across different systems.
How can I speed up evaluations?
- Use SWE-bench Lite instead of the full benchmark
- Increase parallelism with the
num_workers
parameter - Use Modal for cloud-based evaluations
What format should my model's output be in?
Your model should produce a diff or patch that can be applied to the original code. The exact format depends on the instance, but we typically recommend the diff generated by git diff
.
Troubleshooting
My evaluation is stuck or taking too long
Try: - Reducing the number of parallel workers - Checking Docker resource limits - Making sure you have enough disk space
Docker build is failing with network errors
Make sure your Docker network is properly configured and you have internet access. You can try:
docker network ls
docker network inspect bridge
Advanced Usage
How do I use cloud-based evaluation?
Use Modal for cloud-based evaluations:
from swebench.harness.modal_eval.run_modal import run_modal_evaluation
results = run_modal_evaluation(
predictions=predictions,
dataset_name="SWE-bench_Lite",
parallelism=10
)
How do I contribute to SWE-bench?
You can contribute by: - Reporting issues or bugs - Debugging installation and testing issues as you run into them - Suggesting improvements to the benchmark or documentation
I want to build a custom dataset. How do I do that?
Email us at support@swebench.com
and we can talk about helping you build a custom dataset.