Back to blog
Production Engineering

How to Monitor AI Agents in Production

Uptime monitoring is not enough. Here's what you actually need to track, why agent failures are mostly silent, and which tools the industry uses today.

March 6, 2026·12 min read

How to Monitor AI Agents in Production

Uptime monitoring is not enough. Here's what you actually need to track, why agent failures are mostly silent, and which tools the industry uses today.

For CTOs, VP Engineering & IT leaders · 12 min read

Why monitoring an AI agent is different

Traditional monitoring is built around a simple contract: the system either works or it doesn't. A server is up or down. An API returns 200 or 500. Alerts fire, someone fixes it.

AI agents break this contract. An agent can be fully available — no crashes, no timeouts, no error codes — while producing wrong answers, calling the wrong tool, or fabricating information. From an infrastructure perspective, everything looks healthy. From a user perspective, the agent is broken.

The silent failure problem. The biggest production incidents with agents don't throw exceptions. They look like: a confident answer that's factually wrong, a tool call that partially succeeded, a workflow that loops until it hits a timeout. None of these trigger a standard alert.

This is why the AI industry has converged on a broader concept than monitoring: observability. The goal isn't just to know if the agent is running — it's to understand what it's doing, step by step, and whether it's doing it correctly.

What to track: the five layers

A production AI agent generates several distinct types of telemetry. You need all of them — each layer reveals failures that the others miss.

1. Traces

A trace is the complete execution record of one agent interaction: every step, every decision, every tool call, every intermediate output, with timestamps. For a multi-step agent, a single user request can trigger dozens of internal operations. Without traces, when something goes wrong you have no way to know at which step it happened or why.

What good tracing looks like: you can replay any past interaction exactly as it happened, inspect each step in isolation, and compare the execution path when the agent worked correctly versus when it failed.

2. Quality metrics

This is what separates AI monitoring from infrastructure monitoring. You need to measure whether the agent's outputs are actually correct — not just fast and available.

  • Task completion rate — did the agent accomplish what the user asked?
  • Hallucination detection — did the agent produce claims not grounded in its sources or tool outputs? Measured via automated "LLM-as-judge" evaluation on sampled traffic
  • Tool selection quality — did the agent call the right tool, with the right parameters?
  • Instruction adherence — did the agent follow its system prompt, formatting rules, and policy constraints?
  • Multi-turn consistency — does the agent contradict itself across conversation turns?

3. Latency — by percentile, not average

Average latency hides the problem. A multi-step agent might respond in 800ms most of the time, but take 15 seconds for complex queries. The users who experience those 15-second waits drive complaints and churn — the average never shows it.

Track p50, p95, and p99. The p99 (the slowest 1% of requests) is what defines the worst-case user experience. Set alerts on p95 and p99, not on averages.

4. Cost per request

Token costs are not evenly distributed. A small proportion of requests typically accounts for a disproportionate share of your LLM spend. Without per-request cost tracking, you can't identify which queries, workflows, or user segments are burning your budget — and you can't optimize.

Track cost at the trace level, broken down by model, endpoint, and if possible by user segment or workflow type.

5. Drift over time

An agent that performs well at launch can degrade over weeks without any code change. Reasons include: changes in how users phrase requests, upstream data quality shifts, model provider updates, or subtle prompt regressions after a deployment. Without longitudinal quality tracking, drift is invisible until it's severe.

Run automated quality evaluations continuously on sampled production traffic, and compare scores week-over-week. A consistent downward trend is a signal to act before users notice.

How agent failures actually look in production

Understanding the failure modes helps you set up the right alerts. Agent failures in production tend to fall into a few recurring patterns:

The wrong tool, confidently called — The agent selects a plausible-looking tool but the wrong one for the task. The call succeeds (no error), it returns data, and the agent builds its response on that data — which is irrelevant or misleading. The entire downstream output is flawed, but nothing in your infrastructure logs flags it.

Infinite loops — An agent retries a failed operation repeatedly, or continues processing a task that was already completed. This burns compute and token budget silently, and can corrupt data through duplicate operations. Define explicit termination conditions and set circuit breakers on retry loops.

Context loss in multi-turn conversations — In longer sessions, the agent loses track of constraints or prior decisions established earlier in the conversation. It starts contradicting itself or ignoring instructions it acknowledged a few turns back. This is hard to catch with per-request monitoring — it only shows up in session-level analysis.

Prompt drift after deployment — A prompt change that looked fine in testing degrades performance on a class of production queries that wasn't represented in the test set. This shows up as a gradual decline in quality scores for a specific intent type — catchable with segment-level evaluation, invisible with aggregate metrics.

The tools the industry uses today

The observability ecosystem for AI agents has matured significantly. OpenTelemetry has emerged as the industry standard for collecting telemetry — it's vendor-neutral, which means your trace data stays portable across tools. Most major frameworks (LangChain, CrewAI, OpenAI Agents SDK) emit OpenTelemetry-compatible traces natively.

On top of that foundation, several purpose-built platforms have emerged:

  • Langfuse (open-source, self-hostable): Full trace replay, prompt versioning, cost tracking, LLM-as-judge evaluations. Standard choice for teams that want data sovereignty and full control.

  • Arize Phoenix (open-source, cloud option): Strong on drift detection and embedding monitoring. Good for teams that need to track model-level performance degradation alongside agent behavior.

  • LangSmith (LangChain ecosystem): Deep integration for LangChain/LangGraph stacks. Execution graph visualization, prompt comparison, dataset-based regression testing.

  • Datadog LLM Observability (enterprise, full-stack): Connects AI monitoring to your existing infrastructure observability. Best for teams already on Datadog who want unified dashboards across infra and agents.

All four support OpenTelemetry as a data source, so you're not locked in. The practical choice depends on whether you prioritize data control (Langfuse, Arize), ecosystem fit (LangSmith), or infra consolidation (Datadog).

Monitoring for compliance and governance

For teams in regulated industries — finance, healthcare, legal, HR — monitoring isn't just an operational concern. It's a legal one.

An AI agent that influences decisions (a loan recommendation, a candidate screening, a customer response) needs an audit trail that can answer: what did the agent receive as input, what did it output, which tools did it call, and what model version was running at the time? Without this, you can't respond to a compliance inquiry or a regulatory audit.

This means monitoring infrastructure needs to capture and store, in a tamper-evident way: full input/output logs with timestamps, model version and configuration at time of execution, tool calls and their results, and any human approval or override events.

What Azure AI Foundry's team noted: Traditional observability covers metrics, logs, and traces. Agent observability adds two layers on top: evaluations (did the agent achieve the right outcome?) and governance (did it operate within policy and compliance constraints?). Both are needed for production in regulated environments.

A practical setup to start with

If you're instrumenting a production agent for the first time, here's a reasonable sequence that doesn't require months of infrastructure work:

  • Day 1: Traces. Instrument your agent with OpenTelemetry or plug into Langfuse directly. Make sure every execution generates a trace with inputs, outputs, tool calls, and latency per step. This alone gives you the ability to debug failures.

  • Week 1: Latency and cost dashboards. Set up per-request cost tracking and p95/p99 latency monitoring. Set alerts for cost anomalies (sudden spikes in token spend) and for latency regressions after deployments.

  • Week 2: Quality evaluations. Define 3–5 evaluation criteria specific to your use case (relevance, factual grounding, policy adherence). Run them automatically on a sample of production traffic. Establish a baseline score.

  • Month 1: Drift monitoring. Compare quality scores week-over-week. Add segment-level breakdowns (by intent type, user segment, or workflow) to catch regressions that don't show up in aggregate metrics.

  • Ongoing: Audit trail. If you're in a regulated context, ensure logs are stored with version context (model, prompt hash, config) and are accessible for compliance review.

Monitoring built in, not bolted on

Origin 137 includes traces, cost dashboards, and quality observability as native features — no separate instrumentation needed. Every agent execution is logged with full audit trail from day one.

Start free — no card required

Sources

  • OpenTelemetry, AI Agent Observability: Evolving Standards and Best Practices, 2025
  • Microsoft Azure, Agent Factory: Top 5 agent observability best practices, Sept. 2025
  • UptimeRobot, AI Agent Monitoring: Best Practices, Tools and Metrics, 2026
  • Stack AI, The Complete Guide to AI Agent Observability and Monitoring, 2025
  • Vellum, Understanding your agent's behavior in production, Sept. 2025

Solutions for your function

Discover our dedicated landing with use cases, benefits, and demo.

Book a demo