Fleet 1.13:Teams are now shipping 5x more PRs with autonomous pipelines.See what's new →
FleetFleet
Guide

How to Evaluate the Output Quality of Your AI Agents

Once you have more than one AI coding agent running, a gut feeling about which ones are doing good work stops being enough. One agent's PRs keep getting sent back for rework, another burns hours on tasks a cheaper model could handle, a third never coordinates and steps on its teammates. Without a consistent measure, you cannot tell which agent to retune, downgrade, or trust with more autonomy.

Fleet answers this with a stateless 6-dimension evaluation: it scores each agent on task output, reliability, output quality, efficiency, collaboration, and cost. Critically, this evaluation is a separate system from Fleet's risk model. The 6 dimensions measure quality; a distinct risk model over operational signals drives auto-quarantine. This guide explains what each evaluation dimension measures, how to read the scores, and what to do when one comes back low.

Before you start

  • Fleet installed and initialized with agents that have run real tasks
  • The brain daemon available to run (`fleet brain start`) for ongoing scoring
  • Some history of agent activity — evaluation needs work to score
  • Access to `fleet log` to correlate scores with concrete actions
1

Know the six evaluation dimensions

Fleet's evaluation is stateless and scores six dimensions. Task output: did the agent actually produce the work the task asked for. Reliability: does it complete consistently or fail and stall. Output quality: how good is the work — does it pass review, or bounce back. Efficiency: how much run time and rework did it take to get there. Collaboration: does it coordinate through Fabric and hand work off cleanly, or work in isolation. Cost: the run-time cost of its work. These six are evaluation only — they are not the risk model.

# The 6 evaluation dimensions:
#   1. task output    - did it produce what was asked
#   2. reliability    - consistent completion vs. stalls
#   3. output quality  - does the work pass review
#   4. efficiency     - run time + rework to get there
#   5. collaboration  - coordinates via Fabric, clean hand-offs
#   6. cost           - run-time cost of the work
2

Run the evaluation

Use fleet eval to score your agents across the six dimensions. The evaluation is stateless — it computes scores from observed activity each time it runs, rather than carrying hidden state between runs. Start the brain daemon to have evaluation run continuously alongside the rest of Fleet's monitoring.

# Score agents on demand:
fleet eval

# Or run the brain to keep evaluation continuous:
fleet brain start
3

Keep evaluation and risk separate in your head

This is the single most important distinction. The 6-dimension evaluation tells you how good an agent's work is. A separate risk model — a logistic-regression model over operational signals like error rate, restarts, blocked tasks, silent hours, uptime, eval score, and SLA compliance — tells you whether the agent is operationally dangerous, and it is that risk model, not the evaluation, that drives auto-quarantine when risk hits critical. A high-quality agent can still become a quarantine risk if it starts looping; a low-quality agent is not automatically a risk. Read them as two numbers, never as one.

# Two separate systems, surfaced together:
fleet brain insights
#   - 6-dimension evaluation -> quality
#   - risk model            -> drives auto-quarantine at critical
# Never call it '6-dimension risk scoring' — that conflates them.
4

Act on a low task-output or reliability score

Low task output or reliability usually means the agent is not finishing what it starts. The cause is rarely the model — it is more often ambiguous task input or a GitHub state the agent could not handle. Pull the agent's decision log and look for stalls, repeated failed actions, or tasks abandoned partway. Tighten the task specification before blaming the agent.

# Investigate what a low-scoring agent actually did:
fleet log --agent backend-dev --since 7d --type decision
# Look for: abandoned tasks, repeated failures, no completion event
5

Act on a low output-quality or efficiency score

Low output quality shows up as PRs that bounce back through pr_changes_requested round after round. Low efficiency means the agent gets there eventually but burns run time doing it. These are the signals that an agent may be under-modeled for its task — a procedural agent on the wrong model, or a complex task handed to a model that cannot reason through it. Consider the model assignment before concluding the task spec is at fault.

# Count review round-trips for a quality read:
fleet log --agent backend-dev --since 7d | grep pr_changes_requested

# Pair with run-time utilization for the efficiency read:
fleet agent budget
6

Act on a low collaboration or cost score

Low collaboration means the agent is not publishing the Fabric events that the reactive chain depends on — it finishes work but never signals, so the next stage never fires. Check that it is emitting the expected events (pr_created, pr_approved, and so on). A high cost score relative to its output means the agent's run time is not buying enough value; consider a cheaper model or a tighter run-time budget.

# Confirm the agent is publishing chain events:
fleet log --agent tech-lead --type decision --since 7d
# Missing pr_approved / pr_created events = a collaboration gap

# Cross-check cost against run time:
fleet agent budget

Common pitfalls

  • Never describe Fleet as having '6-dimension risk scoring'. The 6 dimensions are the evaluation (quality). The risk model is a separate logistic-regression model over operational signals, and it is the risk model that drives auto-quarantine. Conflating them leads to wrong conclusions about why an agent was quarantined.
  • Evaluation needs activity to score. An agent that has barely run will produce thin or noisy scores. Give an agent a real body of work before drawing conclusions from its evaluation.
  • A low cost or efficiency score is a prompt to investigate, not an automatic verdict. An agent doing genuinely hard work will legitimately consume more run time. Read the cost dimension alongside output quality, not in isolation.
  • Quarantine fires at the critical risk level and is not a configurable threshold you set against the evaluation score. Do not expect a low evaluation score to trigger quarantine on its own — that is the risk model's job, against operational signals.
  • Evaluation is stateless — it reflects observed activity at the time it runs. A bad week will drag scores down even after the underlying cause is fixed. Re-run evaluation after a fix rather than trusting an old score.

When Fleet is the right tool

Fleet's evaluation is worth leaning on once you have several agents and need an objective basis for tuning them — which to retune, which to downgrade to a cheaper model, which to trust with more autonomy. It is most useful as a trend over a real body of work, not a single snapshot. If you are still getting one agent working on one task, evaluation is premature; watch the agent directly first. And remember its scope: evaluation measures work quality across six dimensions, while the separate risk model is what keeps a dangerously misbehaving agent in check.

Frequently asked questions

What does Fleet's agent evaluation measure?

Six dimensions: task output, reliability, output quality, efficiency, collaboration, and cost. It is a stateless score computed from observed agent activity, surfaced via `fleet eval` and the brain daemon.

Is the 6-dimension evaluation the same as Fleet's risk scoring?

No — they are two separate systems. The 6-dimension evaluation measures quality. A separate logistic-regression risk model over operational signals drives auto-quarantine when risk reaches critical. Never call it '6-dimension risk scoring'.

Will a low evaluation score get an agent quarantined?

Not on its own. Quarantine is driven by the separate risk model when risk reaches the critical level, not by the evaluation score. A low-quality agent and a high-risk agent are distinct conditions.

How do I act on a low output-quality score?

Look at how many review round-trips its PRs take in `fleet log` (`pr_changes_requested` events), then check whether its model assignment fits the task. Under-modeling for complex work is a common cause of repeated rework.

Run your first agent fleet

One binary. Five minutes. See every agent, coordinate every handoff, and keep a full audit trail of what your fleet did.