Fleet 1.17.0 is out.See what's new →
FleetFleet
Comparison

Fleet vs LangSmith: Agent Orchestration vs LLM Observability

LangSmith is an observability, tracing, and evaluation platform for LLM applications of any kind. Fleet is a self-hosted orchestration layer that runs and governs a team of Claude Code agents against your GitHub repositories. They sit at different layers — LangSmith watches LLM calls; Fleet runs the agents — and can be used together.

LangSmith, from the LangChain team, is a platform for debugging, testing, evaluating, and monitoring LLM applications. You instrument your app, and LangSmith captures every trace — prompts, tool calls, token usage, latency — so you can inspect runs, build evaluation datasets, and track quality over time. It is framework-agnostic and not limited to coding agents; teams use it for chatbots, RAG pipelines, and any LLM workload.

Fleet is not an observability platform. It is the layer that actually runs the agents: a single Go binary that launches Claude Code agents in defined roles (developer, reviewer, release-manager), reacts to GitHub label events, and hands work between roles through its Fabric event bus. Where LangSmith answers 'what did my LLM calls do and how good were they?', Fleet answers 'who does what, in what order, and with what guardrails?'.

Choose Fleet if

Teams that want to run and coordinate a governed team of Claude Code agents against real GitHub repositories, with event-driven handoffs and approval gates.

Choose LangSmith if

Teams building any kind of LLM application who need deep tracing, evaluation datasets, prompt experimentation, and production monitoring of their model calls.

Fleet vs. LangSmith: side by side

FeatureFleetLangSmith
Primary functionMulti-agent orchestration and governance for coding agentsTracing, evaluation, and observability for LLM apps
ScopeSoftware development: dev, review, releaseAny LLM application — chatbots, RAG, agents, pipelines
DeploymentSelf-hosted Go binary on your infrastructureHosted SaaS (self-hosted available on Enterprise)
EvaluationBuilt-in 6-dimension agent evaluation, plus a separate risk model that drives auto-quarantineDataset-based evals, LLM-as-judge, custom evaluators, regression testing
Tracing depthDecision/fabric event log per agent (not LLM-call tracing)Full call-level traces: prompts, tool calls, token usage, latency
Agent runnerRuns Claude Code as the agent runnerNot a runner — observes whatever LLM/framework you wrap
GitHub automationNative label watcher, PR chain, release gateNot applicable

Where Fleet is the better fit

  • Actually runs and coordinates the agents — LangSmith observes LLM calls but does not orchestrate a team of agents or drive a GitHub workflow
  • Event-driven role-to-role handoffs (dev → reviewer → release-manager) run autonomously through the Fabric event bus
  • Self-hosted single binary; your source code stays local, going only to your model backend and GitHub
  • Built-in governance: per-agent run-time budgets, 6-dimension evaluation, and a separate risk model that auto-quarantines at critical risk

Where LangSmith is the better fit

  • Purpose-built, mature LLM observability: full call-level traces with prompts, tool calls, token usage, and latency that Fleet does not capture
  • Evaluation datasets, LLM-as-judge scoring, and prompt experimentation across any LLM workload, not just coding agents
  • Framework-agnostic — works with LangChain, LangGraph, or any instrumented app, in any domain
  • Production monitoring dashboards and alerting on quality and cost regressions at the model-call level

Pricing

LangSmith has a free Developer tier and paid Plus and Enterprise plans (seat-based with usage on traces); self-hosting is an Enterprise option. See LangSmith's pricing page for current figures. Fleet's Team tier is $299/month per fleet with unlimited agent roles, plus a free tier (one fleet).

Do they compete, or coexist?

These are complementary layers. Fleet runs and coordinates your Claude Code agents; LangSmith can observe and evaluate the underlying model calls those agents make if you instrument them. A team can run Fleet for orchestration and governance while using LangSmith for deep LLM-call tracing and evaluation. Fleet's own 6-dimension evaluation is built for agent-level scoring, not a replacement for LangSmith's call-level observability.

Frequently asked questions

Does Fleet replace LangSmith?

No. Fleet orchestrates and governs a team of coding agents; LangSmith traces and evaluates LLM calls. Fleet includes a built-in 6-dimension agent evaluation and a decision/fabric audit log, but it does not provide call-level prompt/token/latency tracing the way LangSmith does. If you need LLM observability, keep LangSmith and run Fleet alongside it.

Does Fleet track tokens like LangSmith?

No. Fleet's enforced budget is run time (cumulative seconds), and it meters agent status, run counts, and run duration — not token counts. LangSmith captures token usage per call. If token-level cost visibility is your goal, that is LangSmith's job, not Fleet's.

Run your first agent fleet

One binary. Five minutes. See every agent, coordinate every handoff, and keep a full audit trail of what your fleet did.