Fleet 1.13:Teams are now shipping 5x more PRs with autonomous pipelines.See what's new →
FleetFleet
Glossary

SWE-bench

SWE-bench is a benchmark for evaluating AI coding agents, consisting of real GitHub issues from open-source Python projects paired with the actual code changes needed to resolve them.

SWE-bench was introduced in a 2023 paper from Princeton and the University of Chicago. It addresses a gap in earlier coding benchmarks, which tested code generation on synthetic problems or simple interview-style questions. SWE-bench uses real-world software engineering tasks: given an issue description and the repository state at the time the issue was filed, the agent must produce a patch that resolves the issue and passes the project's test suite.

The benchmark became influential because it is harder to overfit to than synthetic benchmarks — the issues are varied, the codebases are complex, and the test suites are written by the original project authors rather than the benchmark creators. Performance on SWE-bench is widely used as a proxy for real-world coding agent capability.

A harder variant, SWE-bench Verified, uses a human-validated subset of issues where the ground-truth resolution is unambiguous. Top models as of late 2024 resolve 40-50% of SWE-bench Verified issues, compared to near-zero percent in early 2024, illustrating the rapid pace of improvement in the field.

How this relates to Fleet

Fleet is not a coding model and does not claim a SWE-bench score. SWE-bench is relevant to Fleet users when selecting which underlying model to use for each agent role — a model's SWE-bench score is the best available public signal of its real-world coding capability, informing the Fleet per-agent model configuration.

Frequently asked questions

Is a high SWE-bench score a reliable predictor of agent performance in my codebase?

It is a useful signal, not a guarantee. SWE-bench uses Python open-source projects; performance on a TypeScript monorepo or a Go microservices codebase may differ. The benchmark also measures single-agent performance on isolated tasks, not multi-agent coordination or performance under sustained load. Use it as a starting point for model selection, then measure on your actual workload.

Who maintains SWE-bench?

SWE-bench is maintained as an open research benchmark. The original paper authors have continued publishing updates including SWE-bench Verified and SWE-bench Multimodal. The leaderboard is publicly available and updated as new agent results are submitted.

Run your first agent fleet

One binary. Five minutes. See every agent, coordinate every handoff, and keep a full audit trail of what your fleet did.