Fleet 1.13:Teams are now shipping 5x more PRs with autonomous pipelines.See what's new →
FleetFleet
Agent templateDevOps

Site Reliability Engineer AI Agent (Template)

An SRE agent monitors service health, responds to on-call signals, and works to reduce toil through automation. In a fleet, it receives incident signals from your alerting system, investigates the likely cause, and either applies the appropriate runbook steps or escalates with a clear summary.

SRE work requires deep familiarity with your specific service topology, SLO definitions, and runbooks. A role-specific prompt encodes these so the agent can navigate your observability tooling and apply the correct runbook rather than starting from scratch on each incident.

What this agent owns

  • Respond to incident signals with initial triage and impact assessment
  • Investigate error spikes, latency regressions, and availability drops using observability tooling
  • Apply runbook steps and document what was tried and what worked
  • Write post-incident summaries with root cause analysis and follow-up action items
  • Identify and automate repetitive operational tasks to reduce manual toil

Recommended model: Claude Opus

Incident diagnosis often requires reasoning across multiple correlated signals; Opus handles ambiguous multi-signal analysis more reliably.

Example tasks

  • Investigate a 5xx error spike and identify the upstream service causing it
  • Write a runbook for recovering from a database connection pool exhaustion event
  • Automate a manual log rotation task that runs weekly
  • Draft a post-incident report for a 45-minute API outage
# create an agent from this template, then start it
$ fleet agent create --name sre--vendor claude-code --template <template-name>
$ fleet agent start sre

Find the exact template name with fleet template list.

Run this agent in your fleet

One binary. Five minutes. See every agent, coordinate every handoff, and keep a full audit trail of what your fleet did.