SWE-bench was introduced in a 2023 paper from Princeton and the University of Chicago. It addresses a gap in earlier coding benchmarks, which tested code generation on synthetic problems or simple interview-style questions. SWE-bench uses real-world software engineering tasks: given an issue description and the repository state at the time the issue was filed, the agent must produce a patch that resolves the issue and passes the project's test suite.
The benchmark became influential because it is harder to overfit to than synthetic benchmarks — the issues are varied, the codebases are complex, and the test suites are written by the original project authors rather than the benchmark creators. Performance on SWE-bench is widely used as a proxy for real-world coding agent capability.
A harder variant, SWE-bench Verified, uses a human-validated subset of issues where the ground-truth resolution is unambiguous. Top models as of late 2024 resolve 40-50% of SWE-bench Verified issues, compared to near-zero percent in early 2024, illustrating the rapid pace of improvement in the field.