Inference API Reference
This section documents the various tools available for running inference with SWE-bench datasets.
Overview
The inference module provides tools to generate model completions for SWE-bench tasks using: - API-based models (OpenAI, Anthropic) - Local models (SWE-Llama) - Live inference on open GitHub issues
In particular, we provide the following important scripts and sub-packages:
make_datasets
: Contains scripts to generate new datasets for SWE-bench inference with your own prompts and issuesrun_api.py
: Generates completions using API models (OpenAI, Anthropic) for a given datasetrun_llama.py
: Runs inference using Llama models (e.g., SWE-Llama)run_live.py
: Generates model completions for new issues on GitHub in real time
Installation
Depending on your inference needs, you can install different dependency sets:
# For dataset generation and API-based inference
pip install -e ".[make_datasets]"
# For local model inference (requires GPU with CUDA)
pip install -e ".[inference]"
Available Tools
Dataset Generation (make_datasets
)
This package contains scripts to generate new datasets for SWE-bench inference with custom prompts and issues. The datasets follow the format required for SWE-bench evaluation.
For detailed usage instructions, see the Make Datasets Guide.
Running API Inference (run_api.py
)
This script runs inference on a dataset using either the OpenAI or Anthropic API. It sorts instances by length and continually writes outputs to a specified file, so the script can be stopped and restarted without losing progress.
# Example with Anthropic Claude
export ANTHROPIC_API_KEY=<your key>
python -m swebench.inference.run_api \
--dataset_name_or_path princeton-nlp/SWE-bench_oracle \
--model_name_or_path claude-2 \
--output_dir ./outputs
Parameters
--dataset_name_or_path
: HuggingFace dataset name or local path--model_name_or_path
: Model name (e.g., "gpt-4", "claude-2")--output_dir
: Directory to save model outputs--split
: Dataset split to use (default: "test")--shard_id
,--num_shards
: To process only a portion of data--model_args
: Comma-separated key=value pairs (e.g., "temperature=0.2,top_p=0.95")--max_cost
: Maximum spending limit for API calls
Running Local Inference (run_llama.py
)
This script is similar to run_api.py
but designed to run inference using Llama models locally. You can use it with SWE-Llama or other compatible models.
python -m swebench.inference.run_llama \
--dataset_path princeton-nlp/SWE-bench_oracle \
--model_name_or_path princeton-nlp/SWE-Llama-13b \
--output_dir ./outputs \
--temperature 0
Parameters
--dataset_path
: HuggingFace dataset name or local path--model_name_or_path
: Local or HuggingFace model path--output_dir
: Directory to save model outputs--split
: Dataset split to use (default: "test")--shard_id
,--num_shards
: For processing only a portion of data--temperature
: Sampling temperature (default: 0)--top_p
: Top-p sampling parameter (default: 1)--peft_path
: Path to PEFT adapter
Live Inference on GitHub Issues (run_live.py
)
This tool allows you to apply SWE-bench models to real, open GitHub issues. It can be used to test models on new, unseen issues without the need for manual dataset creation.
export OPENAI_API_KEY=<your key>
python -m swebench.inference.run_live \
--model_name gpt-3.5-turbo-1106 \
--issue_url https://github.com/huggingface/transformers/issues/26706
Prerequisites
For live inference, you'll need to install additional dependencies: - Pyserini: For BM25 retrieval - Faiss: For vector search
Follow the installation instructions on their respective GitHub repositories: - Pyserini: Installation Guide - Faiss: Installation Guide
Output Format
All inference scripts produce outputs in a format compatible with the SWE-bench evaluation harness. The output contains the model's generated patch for each issue, which can then be evaluated using the evaluation harness.
Tips and Best Practices
- When running inference on large datasets, use sharding to split the workload
- For API models, monitor costs carefully and set appropriate
--max_cost
limits - For local models, ensure you have sufficient GPU memory for the model size
- Save intermediate outputs frequently to avoid losing progress
- When running live inference, ensure your retrieval corpus is appropriate for the repository of the issue