Comparison

Fleet vs LangSmith: Agent Orchestration vs LLM Observability

Name: Fleet
Author: Fleet

LangSmith is an observability, tracing, and evaluation platform for LLM applications of any kind. Fleet is a self-hosted orchestration layer that runs and coordinates a team of Claude Code agents against your GitHub repositories. They sit at different layers — LangSmith watches LLM calls; Fleet runs the agents — and can be used together.

LangSmith, from the LangChain team, is a platform for debugging, testing, evaluating, and monitoring LLM applications. You instrument your app, and LangSmith captures every trace — prompts, tool calls, token usage, latency — so you can inspect runs, build evaluation datasets, and track quality over time. It is framework-agnostic and not limited to coding agents; teams use it for chatbots, RAG pipelines, and any LLM workload.

Fleet is not an observability platform. It is the layer that actually runs the agents: a single Go binary that launches Claude Code agents in defined roles (developer, reviewer, release-manager), reacts to GitHub label events, and hands work between roles through its Fabric event bus. Where LangSmith answers 'what did my LLM calls do and how good were they?', Fleet answers 'who does what, in what order, and with what guardrails?'.

Choose Fleet if

Teams that want to run and coordinate a governed team of Claude Code agents against real GitHub repositories, with event-driven handoffs and approval gates.

Choose LangSmith if

Teams building any kind of LLM application who need deep tracing, evaluation datasets, prompt experimentation, and production monitoring of their model calls.

Fleet vs. LangSmith: side by side

Feature	Fleet	LangSmith
Primary function	Multi-agent orchestration for coding agents — handoffs, approval gates, audit trail	Tracing, evaluation, and observability for LLM apps
Scope	Software development: dev, review, release	Any LLM application — chatbots, RAG, agents, pipelines
Deployment	Self-hosted Go binary on your infrastructure	Hosted SaaS (self-hosted available on Enterprise)
Evaluation	Built-in 6-dimension agent evaluation, plus a separate risk model that drives auto-quarantine	Dataset-based evals, LLM-as-judge, custom evaluators, regression testing
Tracing depth	Decision/fabric event log per agent (not LLM-call tracing)	Full call-level traces: prompts, tool calls, token usage, latency
Agent runner	Runs Claude Code as the agent runner	Not a runner — observes whatever LLM/framework you wrap
GitHub automation	Native label watcher, PR chain, release gate	Not applicable

Where Fleet is the better fit

Actually runs and coordinates the agents — LangSmith observes LLM calls but does not orchestrate a team of agents or drive a GitHub workflow
Event-driven role-to-role handoffs (dev → reviewer → release-manager) run autonomously through the Fabric event bus
Self-hosted single binary; your source code stays local, going only to your model backend and GitHub
Built-in governance: per-agent run-time budgets, 6-dimension evaluation, and a separate risk model that auto-quarantines at critical risk

Where LangSmith is the better fit

Purpose-built, mature LLM observability: full call-level traces with prompts, tool calls, token usage, and latency that Fleet does not capture
Evaluation datasets, LLM-as-judge scoring, and prompt experimentation across any LLM workload, not just coding agents
Framework-agnostic — works with LangChain, LangGraph, or any instrumented app, in any domain
Production monitoring dashboards and alerting on quality and cost regressions at the model-call level

Pricing

LangSmith has a free Developer tier and paid Plus and Enterprise plans (seat-based with usage on traces); self-hosting is an Enterprise option. See LangSmith's pricing page for current figures. Fleet's Team tier is $299/month per fleet with unlimited agent roles, plus a free tier (one fleet).

Do they compete, or coexist?

These are complementary layers. Fleet runs and coordinates your Claude Code agents; LangSmith can observe and evaluate the underlying model calls those agents make if you instrument them. A team can run Fleet for orchestration and governance while using LangSmith for deep LLM-call tracing and evaluation. Fleet's own 6-dimension evaluation is built for agent-level scoring, not a replacement for LangSmith's call-level observability.

Frequently asked questions

Does Fleet replace LangSmith?

No. Fleet orchestrates and governs a team of coding agents; LangSmith traces and evaluates LLM calls. Fleet includes a built-in 6-dimension agent evaluation and a decision/fabric audit log, but it does not provide call-level prompt/token/latency tracing the way LangSmith does. If you need LLM observability, keep LangSmith and run Fleet alongside it.

Does Fleet track tokens like LangSmith?

No. Fleet's enforced budget is run time (cumulative seconds), and it meters agent status, run counts, and run duration — not token counts. LangSmith captures token usage per call. If token-level cost visibility is your goal, that is LangSmith's job, not Fleet's.

More comparisons

Fleet vs. AgentOps →Fleet vs. LangGraph →Fleet vs. CrewAI →

Run your first agent fleet

One binary. Five minutes. See every agent, coordinate every handoff, and keep a full audit trail of what your fleet did.

See how it works Install Fleet