Docker Setup Guide
Introduction
SWE-bench uses a Docker-based evaluation harness to ensure consistent, reproducible results across different platforms. This containerized approach eliminates environment discrepancies and provides isolated environments for each evaluation task.
Prerequisites
Before setting up Docker for SWE-bench, ensure you have:
- Docker installed on your system (Docker installation guide)
- For Linux users, follow the post-installation steps
- Sufficient disk space (at least 120GB free)
- Adequate system resources (16GB+ RAM recommended)
Docker Installation
macOS
- Download and install Docker Desktop for Mac from the official website
- Increase resource allocation in Docker Desktop settings: - Open Docker Desktop preferences - Go to Resources > Advanced - Allocate at least 8 CPUs and 16GB RAM - Set disk image size to at least 120GB
Linux
- Install Docker using your distribution's package manager or follow the official guide
- Add your user to the docker group to run Docker without sudo:
sudo groupadd docker sudo usermod -aG docker $USER newgrp docker # Apply changes without logging out
Windows
- Install Docker Desktop for Windows from the official website
- Ensure WSL 2 is installed and configured
- Increase resource allocation in Docker Desktop settings: - Open Docker Desktop settings - Go to Resources > Advanced - Allocate at least 8 CPUs and 16GB RAM - Set disk image size to at least 120GB
Testing Your Docker Installation
Verify your Docker installation with these commands:
# Check Docker version
docker --version
# Run a simple test container
docker run hello-world
# Check available disk space
docker system df
Docker Resource Management
Understanding SWE-bench's Docker Usage
The SWE-bench evaluation harness builds Docker images in three layers:
- Base image: Common dependencies for all evaluations
- Environment images: Python environments for different configurations (~60 images)
- Instance images: Specific dependencies for each evaluation task
These images require significant disk space, so it's important to understand how to manage them.
Resource Management Commands
Useful commands for managing Docker resources:
# View Docker disk usage
docker system df
# Remove unused containers
docker container prune
# Remove unused images
docker image prune
# Remove all unused Docker objects (containers, images, networks, volumes)
docker system prune
# Remove all stopped containers
docker container prune
# Remove all dangling images
docker image prune
Cache Level Configuration
SWE-bench provides different caching options to balance speed vs. storage:
Cache Level | Description | Storage Impact | Performance |
---|---|---|---|
none |
No image caching | Minimal (~120GB during run) | Slowest |
base |
Cache only base image | Minimal (~120GB during run) | Slow |
env (default) |
Cache base and environment images | Moderate (~100GB) | Moderate |
instance |
Cache all images | High (~2,000GB) | Fastest |
Set the cache level when running the evaluation:
python -m swebench.harness.run_evaluation \
--predictions_path <path_to_predictions> \
--cache_level env \
--clean True
For most users, the default env
setting provides a good balance between evaluation speed and disk usage.
Performance Optimization
Setting the Right Number of Workers
The optimal number of workers depends on your system resources:
- Use fewer than
min(0.75 * os.cpu_count(), 24)
workers - For an 8-core machine, 6 workers is typically appropriate
- For a 16-core machine, 12 workers is typically appropriate
python -m swebench.harness.run_evaluation \
--predictions_path <path_to_predictions> \
--max_workers 8
Increasing worker count beyond your system's capabilities can actually slow down evaluation due to resource contention.
Troubleshooting Docker Issues
Common Problems and Solutions
-
Insufficient disk space: - Free up disk space or increase Docker Desktop's disk image size - Use
--cache_level=env
or--cache_level=base
to reduce storage needs -
Docker build failures: - Check network connectivity - Inspect build logs in
logs/build_images
-
Permission issues: - Ensure your user is in the docker group (Linux) - Run with elevated privileges if necessary
-
Slow evaluation times: - Reduce the number of parallel workers - Check CPU and memory usage during evaluation - Consider using a more powerful machine
-
Network-related issues: - Check Docker network settings:
docker network ls docker network inspect bridge
Cleaning Up After Evaluation
To reclaim disk space after running evaluations:
# Remove all unused Docker resources
docker system prune -a
# Or for more control, remove specific resources
docker container prune # Remove all stopped containers
docker image prune # Remove unused images
You can also set --clean=True
when running the evaluation to automatically clean up instance-specific resources.