Observability for AI agents: what to log when the model is the bug
Latency, cold starts, token burn, error rates, traces. The five panels every team running an agent in production should be staring at by week two.
Gautam Manak
Founder, doc2mcp

The first agent your team ships will work great in demos and fall over silently in production. Not because the model is bad — because you can't see what it's doing. Observability for AI agents is different from observability for regular services in three uncomfortable ways:
- The unit of work is variable — one prompt can fan out to 14 tool calls.
- The dominant cost is tokens, not CPU. You can be 100% available and bankrupt.
- Failures are often "wrong answer", not "exception". Status codes don't catch them.
Panel 1 — Latency p95, per tool
Don't average. Watch p95, broken out per tool, with a small sparkline. The 5% of calls hitting a 4-second cold start are the ones your users actually feel. We alert on a p95 jump larger than 1.5x the rolling 24-hour baseline.
Panel 2 — Cold start rate
If you're running serverless tools, the first call to a cold lambda can be ten times your usual latency. Track cold starts as a first-class metric, not a footnote in latency. Pin the hot path (high-traffic tools) to provisioned concurrency.
Panel 3 — Token burn
Two numbers: tokens-in and tokens-out, per route, per minute. Multiply by your model's per-million-token rate to get a live cost dial. The first time you watch a spike correlate with a single looping prompt, you'll never not track this.
text
alert: token_burn_per_minute > 4x rolling_15min_baseline
window: 5m
notify: oncall
Panel 4 — Error rate, by class
Bucket your errors:
- Hard errors — 5xx, timeouts, tool-not-found. Page someone.
- Soft errors — 4xx the model could recover from (auth, rate limit, validation). Track the recovery rate.
- Wrong-answer signals — user thumbs-down, follow-up edits within 30s, "let me try again" patterns. The hardest to measure, the most important.
Panel 5 — Traces with tool spans
Every conversation = one trace. Every tool call = one span. Even a cheap version of this (just a flat list with timing) makes the difference between "the agent is slow" and "the third call to search_docs took 2.3s because we paginated wrong".
Three alerts worth waking up for
- p95 latency on the primary user-facing tool > 3s for 5 min
- token burn > 4× baseline for 5 min
- thumbs-down rate > 5% in any 100-message window
You don't ship a model. You ship a system. The system needs the same observability hygiene your CRUD apps already have.
Try it
Paste a docs URL. Get an MCP server in 90 seconds.
Free tier included. Works with Cursor, Claude, Windsurf, VS Code, Codex, and Zed.
Generate your MCP
