SWE-bench Multimodal

Do AI Systems Generalize to Visual Software Domains?

John Yang*, Carlos E. Jimenez*, Alex L. Zhang, Kilian Lieret, Joyce Yang, Xindi Wu, Ori Press, Niklas Muennighoff, Gabriel Synnaeve, Karthik Narasimhan, Diyi Yang, Sida I. Wang, Ofir Press
*Equal contribution

Paper GitHub Dataset

Overview

SWE-bench Multimodal augments the original benchmark with 517 issues that contain visual elements such as:

Screenshots of bugs or interface issues
Design mockups or wireframes
Diagrams explaining desired functionality
Error messages with visual context

This extension evaluates models' ability to interpret and act on information presented in both textual and visual formats.

Correspondence

For questions about SWE-bench Multimodal, please contact:

{johnby@stanford.edu, carlosej@princeton.edu}

Citation

If you use SWE-bench Multimodal in your research, please cite our paper:

@inproceedings{yang2025swebench,
    title={SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?},
    author={John Yang and Carlos E Jimenez and Alex L Zhang and Kilian Lieret and Joyce Yang and Xindi Wu and Ori Press and Niklas Muennighoff and Gabriel Synnaeve and Karthik R Narasimhan and Diyi Yang and Sida Wang and Ofir Press},
    booktitle={The Thirteenth International Conference on Learning Representations},
    year={2025},
    url={https://openreview.net/forum?id=riTiq3i21b}
}

Yang, J., Jimenez, C. E., Zhang, A. L., Lieret, K., Yang, J., Wu, X., Press, O., Muennighoff, N., Synnaeve, G., Narasimhan, K. R., Yang, D., Wang, S. I., & Press, O. (2024). SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? arXiv preprint arXiv:2410.03859.

Yang, John, et al. "SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains?" arXiv preprint arXiv:2410.03859 (2024).