ObservabilityProductionCost·May 19, 2026·7 min read

Observability for AI agents: what to log when the model is the bug

Latency, cold starts, token burn, error rates, traces. The five panels every team running an agent in production should be staring at by week two.

Gautam Manak

Founder, doc2mcp

Glass cards labelled Latency p95, Cold Starts, Token Burn, Error Rate and Traces orbiting a glowing EKG-pulse sphere on a dark navy background.

The first agent your team ships will work great in demos and fall over silently in production. Not because the model is bad — because you can't see what it's doing. Observability for AI agents is different from observability for regular services in three uncomfortable ways:

The unit of work is variable — one prompt can fan out to 14 tool calls.
The dominant cost is tokens, not CPU. You can be 100% available and bankrupt.
Failures are often "wrong answer", not "exception". Status codes don't catch them.

Panel 1 — Latency p95, per tool

Don't average. Watch p95, broken out per tool, with a small sparkline. The 5% of calls hitting a 4-second cold start are the ones your users actually feel. We alert on a p95 jump larger than 1.5x the rolling 24-hour baseline.

Panel 2 — Cold start rate

If you're running serverless tools, the first call to a cold lambda can be ten times your usual latency. Track cold starts as a first-class metric, not a footnote in latency. Pin the hot path (high-traffic tools) to provisioned concurrency.

Panel 3 — Token burn

Two numbers: tokens-in and tokens-out, per route, per minute. Multiply by your model's per-million-token rate to get a live cost dial. The first time you watch a spike correlate with a single looping prompt, you'll never not track this.

text

alert: token_burn_per_minute > 4x rolling_15min_baseline
window: 5m
notify: oncall

Panel 4 — Error rate, by class

Bucket your errors:

Hard errors — 5xx, timeouts, tool-not-found. Page someone.
Soft errors — 4xx the model could recover from (auth, rate limit, validation). Track the recovery rate.
Wrong-answer signals — user thumbs-down, follow-up edits within 30s, "let me try again" patterns. The hardest to measure, the most important.

Panel 5 — Traces with tool spans

Every conversation = one trace. Every tool call = one span. Even a cheap version of this (just a flat list with timing) makes the difference between "the agent is slow" and "the third call to search_docs took 2.3s because we paginated wrong".

Three alerts worth waking up for

p95 latency on the primary user-facing tool > 3s for 5 min
token burn > 4× baseline for 5 min
thumbs-down rate > 5% in any 100-message window

You don't ship a model. You ship a system. The system needs the same observability hygiene your CRUD apps already have.

Try it

Paste a docs URL. Get an MCP server in 90 seconds.

Free tier included. Works with Cursor, Claude, Windsurf, VS Code, Codex, and Zed.

Generate your MCP