All official submissions to the SWE-bench leaderboard are maintained at SWE-bench/experiments

Submit to SWE-bench Leaderboard

If you are interested in submitting your model to the SWE-bench Leaderboard, please do the following:

  1. Fork the SWE-bench/experiments repository.
  2. Clone the repository. Due to this repository's large diff history, consider using `git clone --depth 1` if cloning takes too long.
  3. Under the split that you evaluate on (evaluation/lite/ or evaluation/test), create a new folder with the submission date and the model name (e.g. 20240415_sweagent_gpt4).
  4. Within the folder, please include the following files:
    • all_preds.jsonl: Model predictions
    • logs/: SWE-bench evaluation artifacts dump
      • Eval. artifacts means 300/2294 (Lite/Test) folders. Each folder (e.g. astropy__astropy-1234) contains:
        • eval.sh: The evaluation script
        • patch.diff: The model's generated prediction
        • report.json: Summary of evaluation outcomes for this instance
        • run_instance.log: A log of SWE-bench evaluation steps
        • test_output.txt: An output of running eval.sh on patch.diff
      • NOTE: You shouldn't have to create any of these files. They should automatically be generated by SWE-bench evaluation.
    • metadata.yaml: Metadata for how result is shown on website. Please include the following fields:
      • name: The name of your leaderboard entry
      • oss: true if your system is open-source
      • site: URL/link to more information about your system
      • verified: false (See below for results verification)
    • trajs/: Reasoning trace reflecting how your system solved the problem
      • Submit one reasoning trace per task instance. The reasoning trace should show all of the steps your system took while solving the task. If your system outputs thoughts or comments during operation, they should be included as well.
      • The reasoning trace can be represented with any text based file format (e.g. md, json, yaml)
      • Ensure the task instance ID is in the name of the corresponding reasoning trace file.
      • For an example, see SWE-agent + GPT 4 Turbo
    • README.md: Include anything you'd like to share about your model here!
  5. Run python -m analysis.get_results evaluation/<split>/<date + model>
  6. Create a pull request to the SWE-bench/experiments repository with the new folder.

You can refer to this tutorial for a quick overview of how to evaluate your model on SWE-bench.

Submission Guidelines

Please note that we consider an eligible submission to the SWE-bench [Lite] leaderboard to satisfy these criteria:

  1. The use of the hints_text field is not allowed. See our explanation here.
  2. The result should be pass@1. There should be one execution log per task instance for all 2294 task instances.
  3. The result should not be in the "Oracle" retrieval setting. The agent cannot be told the correct files to edit, where "correct" refers to the files modified by the reference solution patch.

Verify Your Results

The Verified check ✓ indicates that we (the SWE-bench team) received access to the model and were able to reproduce the patch generations.

If you are interested in receiving the "verified" checkmark ✓ on your submission, please do the following:

  1. Create an issue
  2. In the issue, provide us instructions on how to run your model on SWE-bench.
  3. We will run your model on a random subset of SWE-bench and verify the results.

Reasoning Traces

(07/29/2024) We have updated the SWE-bench leaderboard submission criteria to require the inclusion of reasoning traces. The goal of this requirement is to provide the community with more insight into how cutting edge methods work without requiring a code release (although the latter is still highly encouraged!).

What is a reasoning trace?

A reasoning trace is a text-based file that describes the steps your system took to solve a task instance. It should provide a detailed account of the reasoning process that your system used to arrive at its solution.

We purposely do not explicitly define reasoning traces in a strict, explicit format.

We do have some guidelines. the reasoning trace should be...

  • Human-readable.
  • Reflects the intermediate steps your system took that led to the final solution.
  • Generated with the inference process, not post-hoc.

We do not require reasoning traces to be...

  • In a specific file format (e.g. json, yaml, md)
  • Conform to a specific problem solving style (e.g. agentic, procedural, etc.)

A simple solution to this? When running inference, simply log the intermediate output generated by your system. For an example, see SWE-agent + GPT-4 Turbo Trajectories. In short, our requirements for what a reasoning trace should specific look like are non-specific. We trust you to provide a detailed account of how your system solved the task instance.

Why are we requiring it?

We believe that reasoning traces can provide valuable insights into how cutting edge methods work without requiring a code release.

As of this post (7/29/2024), we have received many submissions that have pushed the state of the art on SWE-bench, which is exciting to see!

However, we have also found that the top-performing submissions to SWE-bench typically have not open sourced their code nor been verified. We recognize that some leaderboard participants (1) would like to add an entry to SWE-bench but (2) do not want to release their code or proprietary system, which is completely understandable. On the other hand, given that open source systems submitted to SWE-bench have propelled the development of closed-source participants, we would like to continuing promoting development on SWE-bench as a community-level collaborative process.

Therefore, we believe that providing reasoning traces serves as a valuable compromise between these two groups.

What should I submit?

  1. Create a trajs/ folder in your submission directory.
  2. Within this folder, upload a reasoning trace per task instance that your system generated a prediction for.
  3. Make sure the naming convention of the reasoning trace file reflects the SWE-bench task instance it corresponds to. (e.g. astropy__astropy-1234.md)

We will review the reasoning traces you submit. We plan to only accept submissions with reasoning traces for the SWE-bench leaderboard.