Fleet 1.17.0 is out.See what's new →
FleetFleet
Use case

AI Workflows for Runbook Maintenance

A stale runbook is worse than no runbook: at 3am, step four references a dashboard that was renamed two quarters ago, and the on-call engineer burns twenty minutes discovering the docs are lying before falling back to tribal knowledge. Runbooks decay because verifying them is nobody's job — they're only read during incidents, which is exactly when nobody can fix them.

The sources of truth (deploy configs, service docs, alert definitions) keep moving; the runbooks that reference them don't.

How it works with an agent fleet

A scheduled Fleet workflow audits each runbook against the current state of its referenced sources, drafts corrections, and routes them to the on-call lead for approval — so decay gets caught on a Tuesday afternoon instead of during an outage.

genflows:
  - name: runbook-audit
    schedule: "0 10 * * 2"   # Tuesdays
    steps:
      - {name: audit, prompt: "Audit this runbook against the corpus: flag steps referencing renamed/removed services, dashboards, or commands, and draft the corrected version.", corpus: ["docs/runbooks/*.md", "deploy/**/*.yaml", "docs/services/*.md"], for_each: "docs/runbooks/*.md", kind: report, out: corrected.md}
      - {name: verify, prompt: "Check each correction against the sources. Flag any fix that itself can't be verified from the corpus.", depends_on: [audit], kind: review, out: review.md}
      - {name: oncall-lead-ok, depends_on: [audit, verify], kind: approval, out: decision.md}

The fan-out audits every runbook each week; fingerprinting means runbooks whose sources didn't change skip instantly, so the steady-state run is small. The lead approves corrections with the verification flags in view.

The fleet pattern

Schedule → fan-out audit of runbooks against live config/docs → verification review → on-call lead approval. The runbooks' freshness becomes a property of the system rather than a hope.

Guardrails that matter here

  • Corrections are verified by a second pass before any human sees them — a correction the corpus can't support gets flagged, not approved by momentum
  • The lead's approval gates every change to documents people follow under pressure
  • Weekly cadence + incremental rebuild keeps cost proportional to what actually changed

Who this is for

SRE and platform teams whose incident response depends on runbooks, and who have been burned by following one into a renamed world.

Frequently asked questions

Can it verify steps that touch live systems?

It verifies against the corpus — configs, service docs, alert definitions committed to the repo. Steps whose truth lives only in a live console are flagged as unverifiable, which is itself useful: those are the steps most likely to be stale.

How is this different from a docs linter?

A linter checks links and formatting. This reads the runbook's meaning against the current deploy configs and service docs — 'the failover command references a service that no longer exists' is a semantic catch, then a drafted fix, then a human-approved correction.

Run your first agent fleet

One binary. Five minutes. See every agent, coordinate every handoff, and keep a full audit trail of what your fleet did.