Langsmith vs Langfuse vs Arize, which agent observability tool?

All three ingest OpenTelemetry GenAI traces in 2026. Langfuse is open-source-leaning and the cheapest to self-host. Langsmith integrates tightest with LangChain and LangGraph stacks. Arize is the most mature on the evals and drift surface. Pick by which dimension matters most: cost (Langfuse), framework alignment (Langsmith), or evals depth (Arize).

AI Agent Monitoring and Observability: A 2026 Production Playbook

Q: What should I log for an AI agent?

At minimum: run id, parent run id, model and version, tool name and arguments hash, output hash, latency, token usage, cost, outcome, and user or tenant identifier. Use the OpenTelemetry GenAI semantic conventions as the schema. Hash payloads rather than logging them raw to avoid leaking secrets.

Q: How do I trace a multi-step AI agent?

Use distributed tracing with one span per LLM call and one per tool call, parented to a run span. The OpenTelemetry GenAI semantic conventions define standard attributes (gen_ai.system, gen_ai.request.model, gen_ai.usage.input_tokens, etc.). Most modern observability vendors (Datadog, Honeycomb, Langfuse, Langsmith, Arize) ingest these directly.

Q: What are the golden signals for an AI agent?

Adapted from Google SRE: latency (per-call and end-to-end), traffic (runs per second), errors (tool errors, model errors, schema validation failures), and saturation (queue depth, rate-limit headroom). Add two agent-specific signals: quality (eval pass rate on a sampled set) and cost per run.

Q: How do I detect AI agent quality drift?

Sample a fraction of production runs (typically 1-5 percent) and replay them through an eval suite that scores correctness on a labelled benchmark. Track pass rate as a time series. Alert when the rolling pass rate falls below threshold. Model version changes, prompt edits, and upstream API shape changes are the usual root causes.

The first time I shipped an agent without proper observability I did not notice quality degradation for nine days. Token costs were stable, latency was fine, error rates were nominal. The agent was answering questions, completing tasks, returning structured output. Then a customer pointed out that the action it was taking on their behalf had subtly changed shape after an upstream API update. By the time we caught it, we had been wrong on roughly four percent of runs for over a week.

That experience is the reason this piece exists. Agent observability is not a port of LLM observability. It is its own discipline, and most production failures I have watched were detectable in the trace if anyone had been looking. This is the schema, the tracing approach, the golden signals, and the alerting policy I would hand to an engineering lead instrumenting their first agent in production in 2026.

Why agent observability is different

A traditional service trace is short, deterministic, and shallow. A GET request goes through a router, a controller, a database, returns. An agent run is long (sometimes minutes), non-deterministic (the model picks the path), and deep (each tool call has its own LLM context). Standard APM tooling captures spans but the meaning of those spans differs.

Three properties matter for agents specifically. First, the trace is the work. Unlike a stateless API call, the trace contains the agent's reasoning chain; losing the trace means losing the only record of why the agent did what it did. Second, the output is non-deterministic. Two identical inputs may produce different outputs; observability must account for variance, not just errors. Third, quality is a property of the output, not the call. A 200 response from a tool can still carry wrong content; you cannot infer quality from the status code.

What to log

Start with the OpenTelemetry GenAI semantic conventions as the baseline. They define standard attributes (OpenTelemetry GenAI, retrieved 2026) and most modern observability vendors ingest them directly.

Per-run attributes

run.id and parent.run.id for chaining
tenant.id and user.id for per-customer slicing
agent.id and agent.version for change attribution
outcome (success, partial, failed, halted)
total.cost.usd and total.duration.ms

Per-LLM-call attributes

gen_ai.system and gen_ai.request.model
gen_ai.usage.input_tokens, output_tokens, cached_tokens
gen_ai.response.finish_reasons
prompt.hash and response.hash (not the raw payloads, see security best practices)

Per-tool-call attributes

tool.name and tool.version
tool.args.hash and tool.output.hash
tool.outcome and tool.error.class
tool.duration.ms

Distributed tracing for agents

One run is one trace. Inside the run, each LLM call and each tool call gets its own span, parented to the run span. Sub-agent calls get their own nested run spans. This shape makes a multi-step agent legible: you can read the trace as a depth-first traversal of the reasoning tree.

Span attributes worth standardising

The OpenTelemetry GenAI conventions are in stable status as of 2025 and cover the common attributes across model providers. Span names follow gen_ai.invoke for LLM calls and tool.invoke for tools. Use the canonical names; vendor-specific names break portability.

Sampling strategy

Storing every span is expensive at scale. A common pattern is 100 percent sampling of failed and partial runs, 10 percent of successful runs, plus a separate ring buffer of the most recent 100 runs at full fidelity for debugging. The math (run rate × span count × bytes per span × retention) should drive the sampling rate, not the other way around.

Golden signals adapted for agents

Google SRE's golden signals (latency, traffic, errors, saturation) are the right starting point but agents need two more.

Latency

P50, P95, P99 end-to-end. Separately track per-call P95 for the LLM and per-tool P95. End-to-end latency drifts upward when the agent grows a longer reasoning chain; per-call drift signals a model or vendor problem.

Traffic

Runs per second, with breakouts by agent, by tenant, by tool. Useful for capacity planning and for spotting attack patterns (a tenant suddenly running 100x normal volume is a signal).

Errors

Tool errors (4xx, 5xx, timeouts), model errors (rate limits, refusals, schema validation failures), and structural errors (the agent returned but the output failed downstream validation). Errors must be classified, not aggregated.

Saturation

Queue depth for runs waiting to start, rate-limit headroom against model providers, budget headroom against per-tenant caps. Saturation predicts the next-five-minute failure mode.

Quality

The agent-native sixth signal. Pass rate against a labelled eval set, refreshed via sampling from production. Drift on this signal predicts customer-visible failures before they reach support tickets.

Cost

Cost per run, cost per outcome, and cost per tenant. The accounting signal is operationally meaningful: a 3x spike in cost per run for the same outcome shape is a regression, even if quality and latency hold.

Quality drift detection

Quality is the signal most teams instrument last and most teams regret instrumenting last. The general pattern: pick a stable representative sample of agent inputs (50-200 per agent), run them through an automatic eval (string match, semantic match against gold, judge model with a rubric), score them, track the pass rate over time. When pass rate falls below threshold for two consecutive sample windows, alert.

Causes of drift fall into three buckets. Model drift: provider quietly updated the model weights. Prompt drift: someone edited the system prompt without re-running evals. Upstream drift: a tool's API changed shape (a field renamed, a response truncated) and the agent's parsing now misses the relevant content. The trace tells you which.

For the deeper view see AI agent failure modes and AI agent reliability testing explained.

Alerting and SLOs

Define SLOs per agent, not per service. An interactive agent SLO might be: P95 end-to-end latency under 8 seconds, pass rate above 92 percent on the eval set, cost per run under twelve cents. A background agent SLO is different: P95 latency under three minutes is fine, pass rate above 95 percent, cost per run under thirty cents.

Alert on burn rate against the SLO budget, not on raw threshold crossings. Google's SRE book describes this in detail (Google SRE Book, Service Level Objectives). The burn-rate alert is the difference between "we have nine months to fix this" and "we have nine hours".

Tooling landscape in 2026

Three tiers of agent observability tooling are mature in 2026.

General-purpose APM with GenAI extensions

Datadog, Honeycomb, New Relic, and Grafana now ingest OpenTelemetry GenAI traces with first-class views. The advantage is unification with the rest of your service telemetry; the trade-off is that agent-specific features (eval scoring, prompt management) are thinner than the specialist tools.

Specialist agent observability

Langsmith, Langfuse, Arize Phoenix, Helicone, and Weights & Biases Weave. Each has its own emphasis. Langfuse is open-source and the cheapest to self-host. Langsmith is tightest with the LangChain ecosystem. Arize is the most mature on evals. Helicone is strongest on simple proxy-based ingestion. Pick by which dimension matters most to your team.

Self-built on OTel + ClickHouse

For teams with deep observability culture, OpenTelemetry plus ClickHouse plus Grafana is the most cost-efficient setup at scale. You write the dashboards, you own the schema, you pay for storage instead of per-event. Worth it past roughly one million runs per month.

Cost math for observability

Observability is not free. At a million runs per month, each producing 12 spans at 2 KB per span with 30-day retention, you are storing roughly 30 GB per month after compression. At specialist-vendor per-event pricing this can run into thousands of dollars; at self-hosted OTel plus ClickHouse it is closer to single digits.

The decision is buyer-dependent. A team with five agents and ten thousand runs per month can ship a specialist tool and stop worrying. A team with fifty agents and ten million runs per month will hit a price wall and need to either self-host or trim retention.

For the broader operational cost view see AI agent cost optimization and AI agent cost models explained.

Frequently asked questions

What should I log for an AI agent?

Run id, parent run id, model and version, tool name and arguments hash, output hash, latency, token usage, cost, outcome, and user or tenant identifier. Use the OpenTelemetry GenAI semantic conventions as the schema. Hash payloads rather than logging them raw.

How do I trace a multi-step AI agent?

One span per LLM call and one per tool call, parented to a run span. The OpenTelemetry GenAI conventions define standard attributes. Most modern observability vendors (Datadog, Honeycomb, Langfuse, Langsmith, Arize) ingest them.

What are the golden signals for an AI agent?

Latency, traffic, errors, saturation, plus quality (eval pass rate) and cost per run.

How do I detect AI agent quality drift?

Sample a fraction of runs and replay them through an eval suite. Track pass rate as a time series. Alert when pass rate falls below threshold. Model, prompt, and upstream API changes are the usual causes.

Langsmith vs Langfuse vs Arize, which one?

All ingest OpenTelemetry GenAI in 2026. Langfuse is cheapest to self-host. Langsmith is tightest to LangChain stacks. Arize is most mature on evals.

Three things to instrument this week

One trace per run with OpenTelemetry GenAI attributes. Even imperfect tracing beats no tracing.
Quality sample-and-replay on a labelled set. Start with 50 examples; expand from there.
Burn-rate alert on the most important SLO. End-to-end latency or pass rate.

Sources

OpenTelemetry, "GenAI semantic conventions", retrieved 2026, opentelemetry.io
Google SRE, "Service Level Objectives", sre.google
Anthropic, "Building Effective Agents", retrieved 2026, anthropic.com
Langfuse, "Open-source LLM engineering platform", 2026, langfuse.com
Arize AI, "Phoenix observability", 2026, arize.com