Fleet 1.13:Teams are now shipping 5x more PRs with autonomous pipelines.See what's new →
FleetFleet
Use case

AI Agents for On-Call and Incidents

On-call is one of the most disruptive demands on a developer. A page at 2am requires full context reconstruction — what is the service, what is the symptom, what recently changed, where do the logs live — before any diagnostic work can begin. Even experienced engineers spend significant time on this orientation before they can start working the problem.

Post-incident work has a similar problem: writing the post-mortem, identifying contributing factors, and filing follow-up tickets is important but gets deprioritized when the immediate fire is out and everyone wants to get back to normal work.

How it works with an agent fleet

Fleet supports two incident-related agent patterns: a triage agent that helps orient responders to an ongoing incident, and a post-incident agent that drafts the post-mortem and files follow-up tickets.

# Dispatch the on-call triage agent with incident context
fleet task assign on-call-agent "P1 alert: elevated 5xx on /api/checkout, started 02:14 UTC, see #incident-live"

# The agent gathers context, checks recent deploys, and publishes findings
fleet log --agent on-call-agent --since 1h

The triage agent prompt instructs it to check recent deploys, query logs, correlate with known issues, and publish a structured summary. It does not make changes — it gathers information and surfaces it.

The fleet pattern

Triage agent is information-gathering only. It cannot make production changes. It publishes a structured fabric event with what it found — recent changes, correlated errors, relevant runbook links — and the on-call human uses that as a starting point. Post-incident agent runs after resolution, drafting the timeline from logs and fabric events and opening a PR with the post-mortem document for human review.

Guardrails that matter here

  • Triage agent is read-only — it cannot execute changes in production environments
  • All post-mortem drafts are reviewed by a human before publication — the agent drafts, the on-call engineer finalizes
  • Follow-up tickets filed by the post-incident agent are labeled for human triage before entering the active backlog

Who this is for

Engineering teams with on-call rotations who want faster orientation during incidents. Also useful for teams that consistently skip post-mortem documentation because the work is tedious after a stressful incident.

Frequently asked questions

Can the agent page the right person automatically?

The agent can publish fabric events that trigger other agents or send notifications via tools available in the environment. Pagerduty routing logic would live in your existing on-call tooling, not in Fleet itself.

How does the triage agent access production logs?

The agent shells out to whatever log tooling you have available — CloudWatch, Datadog CLI, kubectl logs, etc. You configure the log access commands in the agent prompt and ensure the agent's environment has the appropriate credentials.

Run your first agent fleet

One binary. Five minutes. See every agent, coordinate every handoff, and keep a full audit trail of what your fleet did.