Once you have more than one AI coding agent running, a gut feeling about which ones are doing good work stops being enough. One agent's PRs keep getting sent back for rework, another burns hours on tasks a cheaper model could handle, a third never coordinates and steps on its teammates. Without a consistent measure, you cannot tell which agent to retune, downgrade, or trust with more autonomy.
Fleet answers this with a stateless 6-dimension evaluation: it scores each agent on task output, reliability, output quality, efficiency, collaboration, and cost. Critically, this evaluation is a separate system from Fleet's risk model. The 6 dimensions measure quality; a distinct risk model over operational signals drives auto-quarantine. This guide explains what each evaluation dimension measures, how to read the scores, and what to do when one comes back low.