Frequently Asked Questions

General Questions

What is SWE-bench?

SWE-bench is a benchmark for evaluating large language models on real-world software engineering tasks. It consists of GitHub issues and their corresponding fixes, allowing LLMs to be evaluated on their ability to generate patches that resolve these issues.

Which datasets are available?

SWE-bench offers four main datasets: - SWE-bench: The full benchmark with 2,294 instances - SWE-bench Lite: A smaller subset with 534 instances - SWE-bench Verified: 500 instances verified by engineers as solvable - SWE-bench Multimodal: 100 development instances with screenshots and UI elements (test eval is on the SWE-bench API.)

How does the evaluation work?

The evaluation process: 1. Sets up a Docker environment for a repository 2. Applies the model's generated patch 3. Runs the repository's test suite 4. Determines if the patch successfully resolves the issue

What metrics are reported?

Key metrics include: - Total instances: Number of instances in the dataset - Instances submitted: Number of instances the model attempted - Instances completed: Number of instances that completed the evaluation process - Instances resolved: Number of instances where the model's patch fixed the issue - Instances unresolved: Number of instances where the evaluation completed but the issue wasn't fixed - Instances with empty patches: Number of instances where the model returned an empty patch - Instances with errors: Number of instances where the evaluation encountered errors - Resolution rate: The percentage of submitted instances that were successfully resolved

Installation and Setup

How can I save disk space when using Docker?

Use these commands to clean up Docker resources:

# Remove unused containers
docker container prune

# Remove unused images
docker image prune

# Remove all unused Docker objects
docker system prune

You can also set --cache_level=env and --clean=True when running swebench.harness.run_evaluation to only dynamically remove instance images after they are used. This will make each run longer, but will use less disk space.

Using the Benchmark

How do I run evaluations on my own model?

Generate predictions in the required format and use the evaluation harness:

from swebench.harness.run_evaluation import run_evaluation

predictions = [
    {
        "instance_id": "repo_owner_name__repo_name-issue_number",
        "model": "my-model-name",
        "prediction": "code patch here"
    }
]

results = run_evaluation(
    predictions=predictions,
    dataset_name="SWE-bench_Lite",
)

More effectively, you can save the predictions to a file and then run the evaluation harness on that file:

python -m swebench.harness.run_evaluation --predictions_path predictions.jsonl

Can I run evaluations without using Docker?

No. Docker is required for consistent evaluation environments. This ensures that the evaluation is reproducible across different systems.

How can I speed up evaluations?

Use SWE-bench Lite instead of the full benchmark
Increase parallelism with the num_workers parameter
Use Modal for cloud-based evaluations

What format should my model's output be in?

Your model should produce a diff or patch that can be applied to the original code. The exact format depends on the instance, but we typically recommend the diff generated by git diff.

Troubleshooting

My evaluation is stuck or taking too long

Try: - Reducing the number of parallel workers - Checking Docker resource limits - Making sure you have enough disk space

Docker build is failing with network errors

Make sure your Docker network is properly configured and you have internet access. You can try:

docker network ls
docker network inspect bridge

Advanced Usage

How do I use cloud-based evaluation?

Use Modal for cloud-based evaluations:

from swebench.harness.modal_eval.run_modal import run_modal_evaluation

results = run_modal_evaluation(
    predictions=predictions,
    dataset_name="SWE-bench_Lite",
    parallelism=10
)

How do I contribute to SWE-bench?

You can contribute by: - Reporting issues or bugs - Debugging installation and testing issues as you run into them - Suggesting improvements to the benchmark or documentation

I want to build a custom dataset. How do I do that?

Email us at support@swebench.com and we can talk about helping you build a custom dataset.