Skip to content

Inference API Reference

This section documents the various tools available for running inference with SWE-bench datasets.

Overview

The inference module provides tools to generate model completions for SWE-bench tasks using: - API-based models (OpenAI, Anthropic) - Local models (SWE-Llama) - Live inference on open GitHub issues

In particular, we provide the following important scripts and sub-packages:

  • make_datasets: Contains scripts to generate new datasets for SWE-bench inference with your own prompts and issues
  • run_api.py: Generates completions using API models (OpenAI, Anthropic) for a given dataset
  • run_llama.py: Runs inference using Llama models (e.g., SWE-Llama)
  • run_live.py: Generates model completions for new issues on GitHub in real time

Installation

Depending on your inference needs, you can install different dependency sets:

# For dataset generation and API-based inference
pip install -e ".[make_datasets]"

# For local model inference (requires GPU with CUDA)
pip install -e ".[inference]"

Available Tools

Dataset Generation (make_datasets)

This package contains scripts to generate new datasets for SWE-bench inference with custom prompts and issues. The datasets follow the format required for SWE-bench evaluation.

For detailed usage instructions, see the Make Datasets Guide.

Running API Inference (run_api.py)

This script runs inference on a dataset using either the OpenAI or Anthropic API. It sorts instances by length and continually writes outputs to a specified file, so the script can be stopped and restarted without losing progress.

# Example with Anthropic Claude
export ANTHROPIC_API_KEY=<your key>
python -m swebench.inference.run_api \
    --dataset_name_or_path princeton-nlp/SWE-bench_oracle \
    --model_name_or_path claude-2 \
    --output_dir ./outputs

Parameters

  • --dataset_name_or_path: HuggingFace dataset name or local path
  • --model_name_or_path: Model name (e.g., "gpt-4", "claude-2")
  • --output_dir: Directory to save model outputs
  • --split: Dataset split to use (default: "test")
  • --shard_id, --num_shards: To process only a portion of data
  • --model_args: Comma-separated key=value pairs (e.g., "temperature=0.2,top_p=0.95")
  • --max_cost: Maximum spending limit for API calls

Running Local Inference (run_llama.py)

This script is similar to run_api.py but designed to run inference using Llama models locally. You can use it with SWE-Llama or other compatible models.

python -m swebench.inference.run_llama \
    --dataset_path princeton-nlp/SWE-bench_oracle \
    --model_name_or_path princeton-nlp/SWE-Llama-13b \
    --output_dir ./outputs \
    --temperature 0

Parameters

  • --dataset_path: HuggingFace dataset name or local path
  • --model_name_or_path: Local or HuggingFace model path
  • --output_dir: Directory to save model outputs
  • --split: Dataset split to use (default: "test")
  • --shard_id, --num_shards: For processing only a portion of data
  • --temperature: Sampling temperature (default: 0)
  • --top_p: Top-p sampling parameter (default: 1)
  • --peft_path: Path to PEFT adapter

Live Inference on GitHub Issues (run_live.py)

This tool allows you to apply SWE-bench models to real, open GitHub issues. It can be used to test models on new, unseen issues without the need for manual dataset creation.

export OPENAI_API_KEY=<your key>
python -m swebench.inference.run_live \
    --model_name gpt-3.5-turbo-1106 \
    --issue_url https://github.com/huggingface/transformers/issues/26706

Prerequisites

For live inference, you'll need to install additional dependencies: - Pyserini: For BM25 retrieval - Faiss: For vector search

Follow the installation instructions on their respective GitHub repositories: - Pyserini: Installation Guide - Faiss: Installation Guide

Output Format

All inference scripts produce outputs in a format compatible with the SWE-bench evaluation harness. The output contains the model's generated patch for each issue, which can then be evaluated using the evaluation harness.

Tips and Best Practices

  • When running inference on large datasets, use sharding to split the workload
  • For API models, monitor costs carefully and set appropriate --max_cost limits
  • For local models, ensure you have sufficient GPU memory for the model size
  • Save intermediate outputs frequently to avoid losing progress
  • When running live inference, ensure your retrieval corpus is appropriate for the repository of the issue