SWE-bench

Name: Fleet
Author: Fleet

SWE-bench was introduced in a 2023 paper from Princeton and the University of Chicago. It addresses a gap in earlier coding benchmarks, which tested code generation on synthetic problems or simple interview-style questions. SWE-bench uses real-world software engineering tasks: given an issue description and the repository state at the time the issue was filed, the agent must produce a patch that resolves the issue and passes the project's test suite.

The benchmark became influential because it is harder to overfit to than synthetic benchmarks — the issues are varied, the codebases are complex, and the test suites are written by the original project authors rather than the benchmark creators. Performance on SWE-bench is widely used as a proxy for real-world coding agent capability.

A harder variant, SWE-bench Verified, uses a human-validated subset of issues where the ground-truth resolution is unambiguous. Top models as of late 2024 resolve 40-50% of SWE-bench Verified issues, compared to near-zero percent in early 2024, illustrating the rapid pace of improvement in the field.

How this relates to Fleet

Fleet is not a coding model and does not claim a SWE-bench score. SWE-bench is relevant to Fleet users when selecting which underlying model to use for each agent role — a model's SWE-bench score is the best available public signal of its real-world coding capability, informing the Fleet per-agent model configuration.

Frequently asked questions

Is a high SWE-bench score a reliable predictor of agent performance in my codebase?

It is a useful signal, not a guarantee. SWE-bench uses Python open-source projects; performance on a TypeScript monorepo or a Go microservices codebase may differ. The benchmark also measures single-agent performance on isolated tasks, not multi-agent coordination or performance under sustained load. Use it as a starting point for model selection, then measure on your actual workload.

Who maintains SWE-bench?

SWE-bench is maintained as an open research benchmark. The original paper authors have continued publishing updates including SWE-bench Verified and SWE-bench Multimodal. The leaderboard is publicly available and updated as new agent results are submitted.

SWE-bench

How this relates to Fleet

Frequently asked questions

Is a high SWE-bench score a reliable predictor of agent performance in my codebase?

Who maintains SWE-bench?

Related terms

Run your first agent fleet