SWE-bench Multimodal

Do AI Systems Generalize to Visual Software Domains?

John Yang*, Carlos E. Jimenez*,
Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff,
Gabriel Synnaeve, Karthik Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press

About

SWE-bench Multimodal is a dataset for evaluating AI systems on visual software engineering tasks. It contains 619 task instances from 17 popular JavaScript repositories, each featuring images crucial to problem-solving. The dataset covers a range of challenges including UI glitches, map rendering problems, or data visualization bugs. SWE-bench Multimodal challenges AI systems to tackle the diverse, multimodal nature of modern software development.

Citation

@misc{yang2024swebenchmultimodalaisystems,
      title={SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?}, 
      author={John Yang and Carlos E. Jimenez and Alex L. Zhang and Kilian Lieret and Joyce Yang and Xindi Wu and Ori Press and Niklas Muennighoff and Gabriel Synnaeve and Karthik R. Narasimhan and Diyi Yang and Sida I. Wang and Ofir Press},
      year={2024},
      eprint={2410.03859},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.03859},
}

Disclaimer: SWE-bench Multimodal is for research purposes only. Models trained and evaluated on SWE-bench Multimodal can produce unexpected results. We are not responsible for any damages caused by the use of SWE-bench Multimodal, including but not limited to, any loss of profit, data, or use of data.

Usage: If you would like to use this website template for your own leaderboard, please send Carlos & John an email requesting permission. If granted, please make sure to acknowledge the SWE-bench team and link to this leaderboard on the home page of the website.

Correspondence to: carlosej@princeton.edu, johnby@stanford.edu

Leaderboard

Model
% Resolved
Date
Logs
Trajs
Site

🥇 🤠 ✅ SWE-agent Multimodal + GPT 4o (2024-08-06)

12.20

2024-10-06

-

-

🔗

🥈 🤠 ✅ SWE-agent + Claude Sonnet 3.5

12.20

2024-10-06

-

-

🔗

🥉 🤠 ✅ SWE-agent + GPT 4o (2024-08-06)

12.00

2024-10-06

-

-

🔗

🤠 ✅ SWE-agent JavaScript + Claude Sonnet 3.5

12.00

2024-10-06

-

-

🔗

🤠 ✅ SWE-agent Multimodal + Claude 3.5 Sonnet

11.40

2024-10-06

-

-

🔗

🤠 ✅ SWE-agent JavaScript + GPT 4o (2024-08-06)

9.20

2024-10-06

-

-

🔗

🤠 ✅ Agentless + Claude 3.5 Sonnet

6.20

2024-10-06

-

-

🔗

🤠 ✅ RAG + GPT 4o (2024-08-06)

6.00

2024-10-06

-

-

🔗

🤠 ✅ RAG + Claude 3.5 Sonnet

5.00

2024-10-06

-

-

🔗

🤠 ✅ Agentless + GPT 4o (2024-08-06)

3.10

2024-10-06

-

-

🔗