The first time I shipped an agent without proper observability I did not notice quality degradation for nine days. Token costs were stable, latency was fine, error rates were nominal. The agent was answering questions, completing tasks, returning structured output. Then a customer pointed out that the action it was taking on their behalf had subtly changed shape after an upstream API update. By the time we caught it, we had been wrong on roughly four percent of runs for over a week.

That experience is the reason this piece exists. Agent observability is not a port of LLM observability. It is its own discipline, and most production failures I have watched were detectable in the trace if anyone had been looking. This is the schema, the tracing approach, the golden signals, and the alerting policy I would hand to an engineering lead instrumenting their first agent in production in 2026.

Why agent observability is different

A traditional service trace is short, deterministic, and shallow. A GET request goes through a router, a controller, a database, returns. An agent run is long (sometimes minutes), non-deterministic (the model picks the path), and deep (each tool call has its own LLM context). Standard APM tooling captures spans but the meaning of those spans differs.

Three properties matter for agents specifically. First, the trace is the work. Unlike a stateless API call, the trace contains the agent's reasoning chain; losing the trace means losing the only record of why the agent did what it did. Second, the output is non-deterministic. Two identical inputs may produce different outputs; observability must account for variance, not just errors. Third, quality is a property of the output, not the call. A 200 response from a tool can still carry wrong content; you cannot infer quality from the status code.

What to log

Start with the OpenTelemetry GenAI semantic conventions as the baseline. They define standard attributes (OpenTelemetry GenAI, retrieved 2026) and most modern observability vendors ingest them directly.

Per-run attributes

Per-LLM-call attributes

Per-tool-call attributes

Distributed tracing for agents

One run is one trace. Inside the run, each LLM call and each tool call gets its own span, parented to the run span. Sub-agent calls get their own nested run spans. This shape makes a multi-step agent legible: you can read the trace as a depth-first traversal of the reasoning tree.

Span attributes worth standardising

The OpenTelemetry GenAI conventions are in stable status as of 2025 and cover the common attributes across model providers. Span names follow gen_ai.invoke for LLM calls and tool.invoke for tools. Use the canonical names; vendor-specific names break portability.

Sampling strategy

Storing every span is expensive at scale. A common pattern is 100 percent sampling of failed and partial runs, 10 percent of successful runs, plus a separate ring buffer of the most recent 100 runs at full fidelity for debugging. The math (run rate × span count × bytes per span × retention) should drive the sampling rate, not the other way around.

Golden signals adapted for agents

Google SRE's golden signals (latency, traffic, errors, saturation) are the right starting point but agents need two more.

Latency

P50, P95, P99 end-to-end. Separately track per-call P95 for the LLM and per-tool P95. End-to-end latency drifts upward when the agent grows a longer reasoning chain; per-call drift signals a model or vendor problem.

Traffic

Runs per second, with breakouts by agent, by tenant, by tool. Useful for capacity planning and for spotting attack patterns (a tenant suddenly running 100x normal volume is a signal).

Errors

Tool errors (4xx, 5xx, timeouts), model errors (rate limits, refusals, schema validation failures), and structural errors (the agent returned but the output failed downstream validation). Errors must be classified, not aggregated.

Saturation

Queue depth for runs waiting to start, rate-limit headroom against model providers, budget headroom against per-tenant caps. Saturation predicts the next-five-minute failure mode.

Quality

The agent-native sixth signal. Pass rate against a labelled eval set, refreshed via sampling from production. Drift on this signal predicts customer-visible failures before they reach support tickets.

Cost

Cost per run, cost per outcome, and cost per tenant. The accounting signal is operationally meaningful: a 3x spike in cost per run for the same outcome shape is a regression, even if quality and latency hold.

Quality drift detection

Quality is the signal most teams instrument last and most teams regret instrumenting last. The general pattern: pick a stable representative sample of agent inputs (50-200 per agent), run them through an automatic eval (string match, semantic match against gold, judge model with a rubric), score them, track the pass rate over time. When pass rate falls below threshold for two consecutive sample windows, alert.

Causes of drift fall into three buckets. Model drift: provider quietly updated the model weights. Prompt drift: someone edited the system prompt without re-running evals. Upstream drift: a tool's API changed shape (a field renamed, a response truncated) and the agent's parsing now misses the relevant content. The trace tells you which.

For the deeper view see AI agent failure modes and AI agent reliability testing explained.

Alerting and SLOs

Define SLOs per agent, not per service. An interactive agent SLO might be: P95 end-to-end latency under 8 seconds, pass rate above 92 percent on the eval set, cost per run under twelve cents. A background agent SLO is different: P95 latency under three minutes is fine, pass rate above 95 percent, cost per run under thirty cents.

Alert on burn rate against the SLO budget, not on raw threshold crossings. Google's SRE book describes this in detail (Google SRE Book, Service Level Objectives). The burn-rate alert is the difference between "we have nine months to fix this" and "we have nine hours".

Tooling landscape in 2026

Three tiers of agent observability tooling are mature in 2026.

General-purpose APM with GenAI extensions

Datadog, Honeycomb, New Relic, and Grafana now ingest OpenTelemetry GenAI traces with first-class views. The advantage is unification with the rest of your service telemetry; the trade-off is that agent-specific features (eval scoring, prompt management) are thinner than the specialist tools.

Specialist agent observability

Langsmith, Langfuse, Arize Phoenix, Helicone, and Weights & Biases Weave. Each has its own emphasis. Langfuse is open-source and the cheapest to self-host. Langsmith is tightest with the LangChain ecosystem. Arize is the most mature on evals. Helicone is strongest on simple proxy-based ingestion. Pick by which dimension matters most to your team.

Self-built on OTel + ClickHouse

For teams with deep observability culture, OpenTelemetry plus ClickHouse plus Grafana is the most cost-efficient setup at scale. You write the dashboards, you own the schema, you pay for storage instead of per-event. Worth it past roughly one million runs per month.

Cost math for observability

Observability is not free. At a million runs per month, each producing 12 spans at 2 KB per span with 30-day retention, you are storing roughly 30 GB per month after compression. At specialist-vendor per-event pricing this can run into thousands of dollars; at self-hosted OTel plus ClickHouse it is closer to single digits.

The decision is buyer-dependent. A team with five agents and ten thousand runs per month can ship a specialist tool and stop worrying. A team with fifty agents and ten million runs per month will hit a price wall and need to either self-host or trim retention.

For the broader operational cost view see AI agent cost optimization and AI agent cost models explained.

Frequently asked questions

What should I log for an AI agent?

Run id, parent run id, model and version, tool name and arguments hash, output hash, latency, token usage, cost, outcome, and user or tenant identifier. Use the OpenTelemetry GenAI semantic conventions as the schema. Hash payloads rather than logging them raw.

How do I trace a multi-step AI agent?

One span per LLM call and one per tool call, parented to a run span. The OpenTelemetry GenAI conventions define standard attributes. Most modern observability vendors (Datadog, Honeycomb, Langfuse, Langsmith, Arize) ingest them.

What are the golden signals for an AI agent?

Latency, traffic, errors, saturation, plus quality (eval pass rate) and cost per run.

How do I detect AI agent quality drift?

Sample a fraction of runs and replay them through an eval suite. Track pass rate as a time series. Alert when pass rate falls below threshold. Model, prompt, and upstream API changes are the usual causes.

Langsmith vs Langfuse vs Arize, which one?

All ingest OpenTelemetry GenAI in 2026. Langfuse is cheapest to self-host. Langsmith is tightest to LangChain stacks. Arize is most mature on evals.

Three things to instrument this week

Sources