The Complete Guide — 2025 Edition

Agent
Observability
Academy

You can't fix what you can't see. Learn how to trace, debug, and monitor AI agents in production — from structured logging to real-time dashboards and cost tracking.

Start Learning Code Examples

13 Chapters7 Code Examples6 Anti-Patterns10 Resources6 Tool Comparisons

What is Agent Observability?

Why logging isn't enough for AI agents

Agent observability is the ability to understand what your AI agent is doing, why it made specific decisions, and how it performs — without deploying new code. It goes far beyond traditional logging.

Traditional software is deterministic: the same input produces the same output, and a stack trace tells you exactly where things broke. AI agents are non-deterministic: the same input can produce different outputs, failures are often subtle (wrong answer, not crash), and the root cause might be in the context the model saw, not in the code.

Why It's Different

Traditional Software Monitoring

Errors throw exceptions with stack traces
Same input always produces same output
Failures are binary: works or crashes
Response time is the main performance metric
Debugging = reading code + stack trace

Agent Observability

Failures are often subtle (wrong answer, not crash)
Same input can produce different outputs each time
Quality exists on a spectrum, not binary
Latency, tokens, cost, and correctness all matter
Debugging = understanding what the model saw

Key Insight

When an AI agent produces a bad result, the question isn't “what line of code failed?” — it's “what did the model see when it made that decision?” Agent observability gives you the tools to answer that question.

The Three Pillars

Traces

The full execution path of an agent request — from user input through LLM calls, tool executions, and sub-agent invocations to final output. Traces show you the 'what happened' of every request.

User query -> Plan (200ms) -> Search tool (800ms) -> LLM synthesis (1.2s) -> Response

Metrics

Quantitative measurements aggregated over time — latency distributions, token usage, error rates, costs. Metrics show you the 'how is it performing' across all requests.

p50 latency: 1.2s | p99: 8.4s | error rate: 2.3% | cost/req: $0.04

Logs

Structured event records with context — decisions made, tools selected, errors encountered. Logs show you the 'why did it do that' for individual events.

{"traceId":"abc-123","event":"tool.selected","tool":"search","reason":"query matched knowledge_base pattern","confidence":0.87}

Common Blind Spots Without Observability

Hallucination in the middle of a chain

The LLM hallucinates a fact in step 2 of a 5-step execution. Steps 3-5 build on this false premise. Without tracing, the final output looks wrong, but you can't tell which step introduced the error.

Silent tool failures

A tool returns an empty result instead of throwing an error. The agent continues with no data, producing a vague or generic response. Without observability, this looks like a model quality issue, not a tool failure.

Context window overflow

The agent's conversation history grows past the effective attention window. The model starts ignoring earlier instructions. Performance degrades gradually — no error, no crash, just silently worse outputs.

Cost spikes from retry loops

A tool validation failure triggers a retry loop. The agent makes 15 LLM calls for a single user request, consuming $2 in tokens instead of the expected $0.04. Without cost tracking, this goes unnoticed until the bill arrives.

“Observability is not about collecting data. It's about being able to ask arbitrary questions about your system without deploying new code.”

Charity Majors

CTO, Honeycomb

“The hardest bugs in AI systems are the ones where the model confidently does the wrong thing. Without tracing, you'll never know why.”

Harrison Chase

CEO, LangChain

“If you can't see what your agent is doing at every step, you don't have a production system. You have a demo.”

Jason Liu

Creator, Instructor (structured outputs)

Tracing Agent Execution

Following the chain from input to output

Tracing is the backbone of agent observability. A trace captures the entire execution path of a single request — every LLM call, tool execution, and decision point — as a tree of spans with precise timing and metadata.

Without tracing, a multi-step agent is a black box. You see the input and output, but not the 5-10 intermediate steps that produced the result. When something goes wrong, you're guessing which step failed.

Core Concepts

Root Span

The top-level span representing the entire agent execution. Every trace has exactly one root span. Its duration is the total request latency.

Child Span

A sub-operation within a parent span. LLM calls, tool executions, and retrieval steps are typically child spans of the root.

Span Attributes

Key-value metadata attached to a span: model name, token counts, tool name, success/failure, latency, and custom business context.

Span Events

Timestamped log entries within a span. Useful for recording specific moments like 'rate limited, retrying' or 'context window at 80% capacity'.

Anatomy of a Trace Tree

A trace tree shows the parent-child relationships between spans. Each span has a name, duration, and attributes. The tree structure reveals the execution flow and where time is spent.

Example: Agent trace tree (JSON)

// A typical agent trace tree structure
{
  "traceId": "abc-123-def-456",
  "rootSpan": {
    "name": "agent.handle_query",
    "duration": "3,200ms",
    "attributes": { "userId": "usr_42", "model": "gpt-4o" },
    "children": [
      {
        "name": "agent.plan",
        "duration": "450ms",
        "attributes": { "strategy": "decompose", "steps": 3 }
      },
      {
        "name": "agent.step.retrieve",
        "duration": "320ms",
        "attributes": { "tool": "vector_search", "results": 5 }
      },
      {
        "name": "agent.step.llm_call",
        "duration": "1,800ms",
        "attributes": {
          "model": "gpt-4o",
          "inputTokens": 4200,
          "outputTokens": 850,
          "cost": "$0.053"
        }
      },
      {
        "name": "agent.step.tool_call",
        "duration": "280ms",
        "attributes": {
          "tool": "calculator",
          "input": "revenue * 0.15",
          "success": true
        }
      },
      {
        "name": "agent.synthesize",
        "duration": "350ms",
        "attributes": { "outputTokens": 200 }
      }
    ]
  }
}

Reading a trace tree

This trace took 3,200ms total. The LLM call consumed 56% of the time (1,800ms). The planning step generated 3 steps. Vector search found 5 results. The calculator tool succeeded. Total cost was $0.053. If the final answer is wrong, you can pinpoint exactly which step introduced the error.

Trace Waterfall (visual representation)

agent.handle_query

3,200ms

agent.plan

448ms

agent.step.retrieve

320ms

agent.step.llm_call

1792ms

agent.step.tool_call

288ms

agent.synthesize

352ms

The waterfall view instantly reveals that the LLM call dominates latency. Optimizing the LLM step (prompt caching, smaller context, cheaper model) would have the biggest impact.

Implementing Traces with OpenTelemetry

OpenTelemetry (OTEL) is the vendor-neutral standard for distributed tracing. It works with every observability backend — Jaeger, Grafana Tempo, Honeycomb, Langfuse, and more. Here's a production setup for Node.js agents.

OpenTelemetry setup for agent tracing

import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { trace, SpanStatusCode, type Span } from "@opentelemetry/api";

// Initialize OpenTelemetry SDK
const sdk = new NodeSDK({
  resource: new Resource({
    "service.name": "my-agent",
    "service.version": "1.0.0",
    "deployment.environment": "production",
  }),
  traceExporter: new OTLPTraceExporter({
    url: "https://otel-collector.example.com/v1/traces",
  }),
});
sdk.start();

// Create a tracer for agent operations
const tracer = trace.getTracer("agent-tracer");

// Helper to wrap agent operations in spans
export async function withSpan<T>(
  name: string,
  attributes: Record<string, string | number | boolean>,
  fn: (span: Span) => Promise<T>,
): Promise<T> {
  return tracer.startActiveSpan(name, async (span) => {
    span.setAttributes(attributes);
    try {
      const result = await fn(span);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error instanceof Error ? error.message : "Unknown error",
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

// Usage in agent code
async function handleQuery(query: string, userId: string) {
  return withSpan("agent.query", { userId, queryLength: query.length }, async (rootSpan) => {
    // Plan step — automatically a child of rootSpan
    const plan = await withSpan("agent.plan", { query }, async (planSpan) => {
      const result = await planner.createPlan(query);
      planSpan.setAttribute("stepCount", result.steps.length);
      return result;
    });

    // Execute each step
    for (const [i, step] of plan.steps.entries()) {
      await withSpan(
        `agent.step.${step.type}`,
        { stepIndex: i, tool: step.tool ?? "none" },
        async (stepSpan) => {
          const result = await executeStep(step);
          stepSpan.setAttribute("success", result.success);
          stepSpan.setAttribute("outputTokens", result.tokens ?? 0);
          return result;
        },
      );
    }
  });
}

Tracing Best Practices

Name spans by operation, not implementation: Use "agent.plan" not "callGPT4ForPlanning". You want stable names even when you swap models.
Always record token counts and model name on LLM spans: This is essential for cost attribution and performance analysis.
Record tool inputs and outputs (truncated) on tool spans: When a tool returns bad data, you need to see what it received and returned.
Use semantic conventions: Follow OpenTelemetry semantic conventions for GenAI when they become stable. This ensures compatibility across tools.

Structured Logging

Logs that machines and humans can read

Structured logging means emitting logs as typed, machine-parseable JSON objects instead of freeform text strings. Every log entry has a consistent schema with fields for trace ID, event type, timestamps, and relevant metadata.

The difference is stark: console.log("Tool failed: search") is a dead end for debugging. {"event":"tool.error","tool":"search","error":"timeout","latencyMs":5000,"traceId":"abc"} can be queried, aggregated, alerted on, and correlated with other events.

Structured vs Unstructured

Unstructured (text strings)

[2025-07-15 14:32:01] Agent started [2025-07-15 14:32:01] Input: What's my balance? [2025-07-15 14:32:02] Calling GPT-4o... [2025-07-15 14:32:04] Got response (took 2.1s) [2025-07-15 14:32:04] Using 4521 tokens [2025-07-15 14:32:04] Calling tool: get_account [2025-07-15 14:32:05] Tool returned data [2025-07-15 14:32:05] Done!

Can't filter, can't aggregate, can't correlate across requests. Finding "all requests slower than 5 seconds" means parsing text.

Structured (JSON objects)

{"traceId":"a1b","event":"agent.start","ts":1721...} {"traceId":"a1b","event":"llm.call","model":"gpt-4o", "latencyMs":2100,"tokens":4521,"cost":0.032} {"traceId":"a1b","event":"tool.call","tool":"get_account", "latencyMs":890,"success":true} {"traceId":"a1b","event":"agent.complete", "totalLatencyMs":3200,"totalCost":0.032}

Every field is queryable. Find slow requests: latencyMs > 5000. Cost by model: sum(cost) group by model.

Log Levels for AI Agents

Traditional log levels need reinterpretation for agents. A "warning" isn't just "something might be wrong" — it's "the agent is degrading in a way that affects user experience."

ERROR

When to use: Unrecoverable failures — LLM API errors, tool crashes, rate limit exhaustion, context window overflow

{"level":"error","traceId":"abc","event":"llm.error","error":"RateLimitExceeded","model":"gpt-4o","retries":3}

WARN

When to use: Degraded behavior — slow responses, high token usage, tool retries, fallback to cheaper model, approaching context limits

{"level":"warn","traceId":"abc","event":"context.high_usage","tokensUsed":95000,"limit":128000,"percentUsed":74}

INFO

When to use: Normal operations — request start/end, LLM call complete, tool execution success, agent step transitions

{"level":"info","traceId":"abc","event":"tool.executed","tool":"search","latencyMs":245,"success":true,"results":5}

DEBUG

When to use: Detailed diagnostics — full prompts, retrieved documents, model parameters, embedding scores (disabled in production unless sampling)

{"level":"debug","traceId":"abc","event":"rag.results","query":"refund policy","topScore":0.92,"chunks":["...truncated"]}

Implementation with Pino

Pino is the fastest structured logging library for Node.js — zero overhead in production. Use child loggers to propagate trace context without manually passing IDs everywhere.

Structured logging with Pino

import pino from "pino";

// Create a base logger with default fields
const baseLogger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  base: {
    service: "my-agent",
    version: "1.0.0",
    environment: process.env.NODE_ENV,
  },
});

// Create a request-scoped logger with trace context
export function createRequestLogger(traceId: string, userId?: string) {
  return baseLogger.child({
    traceId,
    userId,
    requestStartedAt: Date.now(),
  });
}

// Usage in agent code
async function handleRequest(input: string) {
  const traceId = crypto.randomUUID();
  const log = createRequestLogger(traceId, "usr_42");

  log.info({
    event: "agent.request.start",
    inputLength: input.length,
    inputPreview: input.slice(0, 100),
  });

  const llmStart = performance.now();
  const response = await llm.generate({
    model: "gpt-4o",
    messages: [{ role: "user", content: input }],
  });

  log.info({
    event: "llm.call.complete",
    model: "gpt-4o",
    latencyMs: Math.round(performance.now() - llmStart),
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    totalTokens: response.usage.total_tokens,
    finishReason: response.choices[0].finish_reason,
  });

  if (response.usage.total_tokens > 10000) {
    log.warn({
      event: "llm.high_token_usage",
      totalTokens: response.usage.total_tokens,
      threshold: 10000,
    });
  }

  return response;
}

Correlation IDs with AsyncLocalStorage

The hardest part of logging is propagating trace context through deeply nested function calls. Node.js's AsyncLocalStorage solves this — any function in the async call stack can access the current request's trace context without explicit parameter passing.

Automatic context propagation

// Correlation: connecting logs across services
import { AsyncLocalStorage } from "node:async_hooks";

interface RequestContext {
  traceId: string;
  spanId: string;
  userId: string;
  feature: string;
}

const contextStore = new AsyncLocalStorage<RequestContext>();

// Middleware: set context for entire request lifecycle
export function withRequestContext(
  traceId: string,
  userId: string,
  feature: string,
  fn: () => Promise<void>,
) {
  const ctx: RequestContext = {
    traceId,
    spanId: crypto.randomUUID().slice(0, 8),
    userId,
    feature,
  };

  return contextStore.run(ctx, fn);
}

// Logger automatically includes request context
export function getLogger() {
  const ctx = contextStore.getStore();
  if (!ctx) return baseLogger;

  return baseLogger.child({
    traceId: ctx.traceId,
    spanId: ctx.spanId,
    userId: ctx.userId,
    feature: ctx.feature,
  });
}

// Any function in the call stack can get a correlated logger
async function searchDocuments(query: string) {
  const log = getLogger(); // Automatically has traceId, userId, etc.
  log.info({ event: "rag.search.start", query });

  const results = await vectorDB.search(query);
  log.info({
    event: "rag.search.complete",
    resultCount: results.length,
    topScore: results[0]?.score,
  });

  return results;
}

What to Log (and What Not To)

Always Log

Request start/end with latency
LLM calls: model, tokens, latency, finish reason
Tool calls: name, latency, success/failure
Errors: type, message, retry count
Cost: per-request calculated cost
Decisions: why a tool or model was selected

Never Log (or log with caution)

Full user messages (PII risk — truncate or hash)
Full LLM outputs (storage cost — log length only)
API keys or auth tokens
Full RAG document contents (log metadata only)
Full system prompts (version reference instead)

Metrics & KPIs for Agents

Latency, token usage, success rates, cost

Agent metrics go far beyond response time and error rate. You need to track latency distributions, token consumption, per-request cost, tool call accuracy, and response quality — all broken down by model, feature, and user segment.

The goal is to answer three questions at any time: Is the agent working correctly? How much is it costing? Where are the bottlenecks?

Latency

Time to First Token (TTFT)

How long until the user sees the first token of the response. Critical for perceived performance in streaming UIs.

Target:< 500ms for streaming, < 2s for non-streaming

Total Response Latency

End-to-end time from user input to complete response. Includes planning, tool calls, LLM generation, and post-processing.

Target:p50 < 2s, p95 < 5s, p99 < 15s

Tool Call Latency

Time for each tool execution. Slow tools dominate agent latency since LLM calls block on tool results.

Target:p95 < 1s per tool call

Token Usage

Input Tokens per Request

How much context the model receives. Tracks system prompt size, conversation history, RAG documents, and tool definitions. Directly correlates with cost.

Target:Monitor trend — should not grow unbounded

Output Tokens per Request

How verbose the model's response is. High output tokens may indicate the model is over-explaining or generating unnecessary content.

Target:Model-specific, typically 200-2000 per response

Token Efficiency Ratio

Useful output tokens / total input tokens consumed. A falling ratio means you're stuffing more context for the same quality of output.

Target:> 0.1 (10% of input results in useful output)

Cost

Cost per Request

Total LLM API cost for a single user request, including all LLM calls, retries, and sub-agent invocations.

Target:Define per feature — e.g., simple Q&A < $0.01, complex analysis < $0.10

Cost per Conversation

Total cost across all messages in a conversation. Rises with conversation length due to growing context windows.

Target:Set per-conversation budget caps

Cost per User (daily/monthly)

Aggregated cost attributed to individual users. Identifies power users and potential abuse.

Target:Set per-user daily limits with automatic throttling

Quality & Reliability

Success Rate (verified)

Percentage of requests that produce a verified-correct response. Not just 'didn't throw an error' — use eval checks for actual correctness.

Target:> 95% for production systems

Tool Call Accuracy

Percentage of tool calls where the agent selected the correct tool with valid arguments. Wrong tool selection is a major failure mode.

Target:> 98% correct tool selection

Hallucination Rate

Percentage of responses flagged by automated eval checks as containing fabricated information. Requires running evals on sampled production traffic.

Target:< 2% for factual queries

Retry Rate

Percentage of requests that required at least one retry (LLM or tool). High retry rates indicate instability or poor prompt design.

Target:< 5% of requests require retries

Why Percentiles, Not Averages

An average latency of 1.5 seconds sounds fine. But if your p99 is 25 seconds, 1 in 100 users waits 25 seconds. With 10,000 daily requests, that's 100 terrible experiences per day. Percentiles (p50, p95, p99) reveal the full distribution. Use histogram metric types to compute percentiles efficiently.

Typical Agent Latency Budget (3-second target)

Query preprocessing

90ms

RAG retrieval

300ms

Context assembly

60ms

LLM generation

1,800ms

Tool execution

510ms

Post-processing

90ms

Network/overhead

150ms

LLM generation dominates at 60% of total latency. Optimize the prompt/context size first — a 50% reduction in input tokens can reduce LLM latency by 30-40%.

Implementation with OpenTelemetry Metrics

Agent metrics collection

import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("agent-metrics");

// Latency histograms (use buckets that match your SLOs)
const requestLatency = meter.createHistogram("agent.request.latency_ms", {
  description: "End-to-end request latency",
  unit: "ms",
});

const ttft = meter.createHistogram("agent.ttft_ms", {
  description: "Time to first token",
  unit: "ms",
});

const toolLatency = meter.createHistogram("agent.tool.latency_ms", {
  description: "Individual tool call latency",
  unit: "ms",
});

// Token counters
const inputTokens = meter.createHistogram("agent.tokens.input", {
  description: "Input tokens per request",
});

const outputTokens = meter.createHistogram("agent.tokens.output", {
  description: "Output tokens per request",
});

// Cost gauge
const requestCost = meter.createHistogram("agent.cost_usd", {
  description: "Cost per request in USD",
});

// Quality counters
const requestCounter = meter.createCounter("agent.requests.total", {
  description: "Total requests by status",
});

const toolCallCounter = meter.createCounter("agent.tool_calls.total", {
  description: "Tool calls by tool and status",
});

// Recording metrics in agent code
async function handleRequest(input: string, context: RequestContext) {
  const start = performance.now();
  const labels = {
    model: context.model,
    feature: context.feature,
  };

  try {
    const result = await agent.execute(input, context);
    const latencyMs = performance.now() - start;

    requestLatency.record(latencyMs, { ...labels, status: "success" });
    inputTokens.record(result.usage.inputTokens, labels);
    outputTokens.record(result.usage.outputTokens, labels);
    requestCost.record(result.estimatedCost, labels);
    requestCounter.add(1, { ...labels, status: "success" });

    if (result.ttftMs) {
      ttft.record(result.ttftMs, labels);
    }

    for (const tool of result.toolCalls) {
      toolCallCounter.add(1, {
        tool: tool.name,
        success: String(tool.success),
      });
      toolLatency.record(tool.latencyMs, { tool: tool.name });
    }

    return result;
  } catch (error) {
    requestLatency.record(performance.now() - start, { ...labels, status: "error" });
    requestCounter.add(1, { ...labels, status: "error" });
    throw error;
  }
}

Debugging Agent Failures

Root cause analysis for non-deterministic systems

Debugging AI agents is fundamentally different from debugging traditional software. There are no stack traces for wrong answers, no breakpoints for hallucinations. The root cause often lies in what the model saw, not in the code that ran.

Non-determinism makes reproduction difficult — the same input can produce different outputs on each run. Without captured execution context, you cannot reliably investigate failures. This is why observability is a prerequisite for debugging, not an afterthought.

The Four Categories of Agent Failure

Every agent failure falls into one of these categories. Classifying the failure type is the first step to fixing it — each category has a different debugging approach and different fix.

Model Failures

The LLM itself produces incorrect output despite receiving good context. Includes hallucinations, reasoning errors, instruction non-compliance, and format violations.

How to Debug

1.Compare the model's input (full context window) against its output
2.Check if the answer contradicts information in the context
3.Verify that instructions in the system prompt were followed
4.Test with a different model to isolate model-specific issues

Example: Model claims the refund policy allows 90-day returns, but the RAG document clearly states 30 days. The context was correct; the model hallucinated.

Context Failures

The model produced a reasonable output given what it saw, but it saw the wrong information. Includes missing documents, irrelevant RAG results, stale data, and context overflow.

How to Debug

1.Inspect the exact context window contents at the time of generation
2.Verify RAG retrieval quality — were the right documents retrieved?
3.Check token counts — was the context window overflowing?
4.Look for contradictions between retrieved documents

Example: Agent gives outdated pricing. Trace shows the RAG system retrieved a 2023 pricing document instead of the current 2025 version. The model answered correctly based on what it saw.

Tool Failures

A tool returns incorrect data, times out, or fails silently. The agent continues with bad or missing data, producing a subtly wrong response.

How to Debug

1.Check tool span: what input did the tool receive? What did it return?
2.Look for empty or null tool responses that weren't flagged as errors
3.Verify tool response latency — did it timeout?
4.Compare tool output against ground truth data

Example: Agent says 'you have no open orders' because the database tool returned an empty array due to a permissions error it didn't surface. The trace shows tool.success=true but results=0.

Orchestration Failures

The agent's planning or routing logic made a poor decision. Wrong tool selected, unnecessary steps taken, infinite loops, or premature termination.

How to Debug

1.Examine the plan span: what strategy did the agent choose and why?
2.Count the number of steps — is it reasonable for this query?
3.Look for repeated tool calls that suggest a retry loop
4.Check if the agent terminated before completing all necessary steps

Example: Agent routes a complex billing question to the FAQ search tool instead of the account lookup tool, returning a generic answer instead of the user's specific billing data.

The 6-Step Debugging Workflow

Start from the trace

Pull up the trace for the failing request using its trace ID. The trace tree gives you the full execution path.

Identify the failing span

Walk the trace tree. Which span produced the first incorrect output? Was it a planning step, an LLM call, or a tool execution?

Inspect inputs and outputs

For the failing span, examine what went in (context, parameters) and what came out (response, tool result). The bug is in the gap between expected and actual.

Classify the failure

Is it a model failure (bad output given good context), context failure (bad context), tool failure (bad data), or orchestration failure (bad plan)?

Reproduce with replay

Use the stored execution context (input, messages, tool results) to replay the exact same request. If you can reproduce it, you can fix it.

Fix and add a regression eval

Apply the fix and add this case to your evaluation dataset. Future changes will be tested against this failure case.

Replay Debugging: Reproducing Non-Deterministic Failures

The hardest part of debugging AI agents is that you can't just "re-run it with the same input" — the model might give a different (correct) answer the second time. Replay debugging solves this by capturing and storing the exact execution context so you can feed it back to the model later.

Replay debugging: capture and replay execution context

// Store execution context for replay debugging
interface ExecutionSnapshot {
  traceId: string;
  input: string;
  systemPrompt: string;
  messages: Message[];
  toolResults: Record<string, unknown>;
  retrievedDocuments: Document[];
  modelConfig: { model: string; temperature: number };
  timestamp: number;
}

// Capture on every request (or sample in production)
async function captureSnapshot(
  traceId: string,
  context: AgentContext,
): Promise<void> {
  const snapshot: ExecutionSnapshot = {
    traceId,
    input: context.input,
    systemPrompt: context.systemPrompt,
    messages: context.messages,
    toolResults: context.toolResults,
    retrievedDocuments: context.retrievedDocs,
    modelConfig: {
      model: context.model,
      temperature: context.temperature,
    },
    timestamp: Date.now(),
  };

  // Store in debug database (TTL: 30 days)
  await debugStore.save(traceId, snapshot, { ttlDays: 30 });
}

// Replay a failing request with the exact same context
async function replayRequest(traceId: string): Promise<ReplayResult> {
  const snapshot = await debugStore.get(traceId);
  if (!snapshot) throw new Error(`No snapshot for trace ${traceId}`);

  // Reconstruct the exact context the model saw
  const replayResponse = await llm.generate({
    model: snapshot.modelConfig.model,
    temperature: snapshot.modelConfig.temperature,
    system: snapshot.systemPrompt,
    messages: snapshot.messages,
  });

  return {
    originalTraceId: traceId,
    replayTraceId: crypto.randomUUID(),
    originalOutput: snapshot.messages.at(-1)?.content,
    replayOutput: replayResponse.text,
    matched: replayResponse.text === snapshot.messages.at(-1)?.content,
  };
}

The Debugging Mindset Shift

In traditional software, you debug the code. In AI agents, you debug the context. The model is usually capable of producing the right answer — the question is whether it received the right information. Every debugging session should start with: "What did the model see when it made this decision?"

Observability Tools

LangSmith, Phoenix, Langfuse, Braintrust

The agent observability ecosystem has matured rapidly. From open-source platforms to managed services, there are now dedicated tools for every aspect of LLM monitoring. Here is a comprehensive comparison of the leading options.

The right tool depends on your constraints: team size, budget, existing infrastructure, and whether you need self-hosting. Most teams start with one managed platform and add OpenTelemetry as they scale.

LangChainLangSmith

Production monitoring and debugging platform from the LangChain team. Deep integration with LangChain and LangGraph, but works with any framework via the LangSmith SDK. Strong trace visualization with a playground for prompt iteration.

Key Features

▸Trace visualization with span-level detail
▸Prompt playground for iteration and testing
▸Dataset management for evaluations
▸Online evaluation runners
▸Human annotation queues
▸Comparison views for A/B testing prompts

Pricing: Free tier (5K traces/mo), Plus ($39/seat/mo), Enterprise (custom)

Best For: Teams already using LangChain/LangGraph, or needing strong eval integration.

Arize AIArize Phoenix

Open-source observability platform built specifically for LLM applications. Strong focus on trace visualization, retrieval analysis, and evaluation. Runs locally or in the cloud. Great for debugging RAG systems.

Key Features

▸Open-source (Apache 2.0) — self-host anywhere
▸Trace and span visualization
▸Retrieval (RAG) quality analysis
▸Embedding drift detection
▸LLM evaluation framework
▸OpenTelemetry-native instrumentation

Pricing: Free (open-source self-hosted), Phoenix Cloud (free tier + paid plans)

Best For: Teams wanting open-source with RAG debugging, or needing on-premise deployment.

LangfuseLangfuse

Open-source LLM observability and analytics platform. Framework-agnostic with SDKs for Python and TypeScript. Includes prompt management, cost tracking, and evaluation features. Can be self-hosted.

Key Features

▸Open-source (MIT) — full self-hosting support
▸Framework-agnostic SDK (works with any LLM provider)
▸Prompt management and versioning
▸Cost tracking with model pricing tables
▸Evaluation and scoring pipelines
▸Session and user-level analytics

Pricing: Free (self-hosted or cloud free tier), Pro ($59/mo), Team ($119/mo)

Best For: Teams wanting open-source with strong cost tracking and prompt management.

BraintrustBraintrust

End-to-end platform combining evaluations, logging, and prompt playground. Strong focus on eval-driven development with built-in scoring functions. Also serves as an AI proxy for cost optimization.

Key Features

▸Eval framework with built-in scoring functions
▸Production logging and tracing
▸Prompt playground with side-by-side comparison
▸AI proxy with caching and fallbacks
▸Dataset management for golden test sets
▸CI/CD integration for automated eval runs

Pricing: Free tier, Pro ($25/seat/mo), Enterprise (custom)

Best For: Teams focused on eval-driven development and needing a unified eval + monitoring platform.

HeliconeHelicone

Proxy-based observability that requires just one line of code to integrate. Routes your LLM API calls through Helicone's gateway, automatically capturing cost, latency, and usage metrics without SDK changes.

Key Features

▸One-line integration (URL swap)
▸Automatic cost and latency tracking
▸Request/response logging
▸Rate limiting and caching at the proxy layer
▸User-level analytics and cost attribution
▸Prompt threat detection

Pricing: Free tier (100K requests/mo), Growth ($100/mo), Enterprise (custom)

Best For: Teams wanting instant observability with minimal code changes.

CNCF (Open Standard)OpenTelemetry + Custom Stack

Build your own observability stack using the OpenTelemetry standard. Use OTEL SDKs for instrumentation and export to any backend — Jaeger, Grafana Tempo, Honeycomb, Datadog, or custom storage. Maximum flexibility, but requires more setup.

Key Features

▸Vendor-neutral standard — no lock-in
▸Works with any observability backend
▸OpenLLMetry provides auto-instrumentation for LLM frameworks
▸Full control over what you trace and store
▸Integrates with existing infrastructure
▸Semantic conventions for GenAI (emerging standard)

Pricing: Free (open standard) — backend costs vary

Best For: Teams with existing observability infrastructure, or needing full control over data.

Quick Comparison

Dimension	LangSmith	Arize	Langfuse	Braintrust	Helicone	OpenTelemetry
Open Source	No	Yes (Apache 2.0)	Yes (MIT)	No	No	Yes (CNCF)
Self-Hostable	No	Yes	Yes	No	No	Yes
Tracing	Strong	Strong	Strong	Good	Basic	Excellent
Evals	Strong	Good	Good	Excellent	Basic	Manual
Cost Tracking	Good	Basic	Strong	Good	Excellent	Manual
Setup Effort	Low	Low	Low	Low	Minimal	High

How to Choose

Starting out? Pick Langfuse (open-source, generous free tier) or Helicone (one-line setup).
Using LangChain/LangGraph? LangSmith has the deepest integration and best trace UI for these frameworks.
Need evals + monitoring? Braintrust combines both in one platform, reducing tool sprawl.
Enterprise with existing infra? OpenTelemetry + your existing backend (Datadog, Grafana) avoids adding another vendor.
Privacy/compliance requirements? Langfuse or Phoenix can be self-hosted in your own infrastructure.

Building Dashboards

Real-time visibility into agent performance

A well-designed dashboard is the command center for your agent in production. It should answer three questions at a glance: Is it working? Is it correct? Is it affordable?

The difference between a useful dashboard and a vanity dashboard is actionability. Every panel should either tell you things are fine, or tell you exactly where to look when they are not. If a metric has never triggered an investigation, it does not belong on the dashboard.

RowHealth & Availability

The first row every engineer looks at. Answers: 'Is the agent working right now?' If any panel is red, investigate immediately.

Request Rate (RPM)

rate(agent.requests[5m])

A sudden drop means something is broken upstream. A spike means unexpected traffic or a retry storm.

Alert:< 50% of baseline for 5 minutes

Error Rate by Type

rate(agent.errors[5m]) by (error_type)

Not just 'errors happened' but which type: rate limits, timeouts, tool failures, or model errors. Each has a different fix.

Alert:> 5% error rate sustained for 10 minutes

Latency Distribution (p50/p95/p99)

histogram_quantile([0.5, 0.95, 0.99], agent.latency)

p50 tells you the typical experience. p99 tells you the worst experience. If p99 diverges from p50, you have outlier issues.

Alert:p99 > 15s for 10 minutes

Active Agent Sessions

count(active_sessions)

Track concurrency. If sessions spike, you may hit rate limits or exhaust your API quota.

Alert:Informational — no alert needed

RowQuality & Correctness

The hardest row to build but the most important. Answers: 'Are the agent's responses actually good?' Requires eval pipelines on sampled traffic.

Tool Call Success Rate by Tool

rate(agent.tool_calls{success=true}[5m]) by (tool_name)

If one specific tool is failing while others are fine, the problem is localized. Drill down by tool name.

Alert:Any tool drops below 90% success for 15 minutes

Eval-Flagged Hallucination Rate

rate(agent.eval{result=hallucination}[1h])

Automated evals on sampled production traffic detect hallucinations. A rising rate means the model or context quality is degrading.

Alert:> 3% hallucination rate over 1 hour

Completion Rate

rate(agent.completed[5m]) / rate(agent.started[5m])

What percentage of requests get a full response? Low completion means early terminations, timeouts, or context window overflows.

Alert:Below 95% completion rate

User Feedback Score

avg(agent.user_feedback) by (feature)

Thumbs up/down or star ratings from users. The ground truth signal for quality, but sparse and delayed.

Alert:Trend detection — alert on significant drops

RowCost & Efficiency

Answers: 'How much is the agent costing, and where is the money going?' Essential for budget planning and catching runaway costs before the bill arrives.

Cost per Hour by Model

sum(agent.cost_usd[1h]) by (model)

Identifies which models dominate your spend. If GPT-4o cost suddenly spikes, a prompt change may have increased token usage.

Alert:Hourly cost > 2x the 7-day average

Tokens per Request (Input vs Output)

avg(agent.tokens) by (type)

Rising input tokens = growing context. Rising output tokens = verbosity. Both directly increase cost.

Alert:Average input tokens grows > 20% week-over-week

Cost per User (Top 10)

topk(10, sum(agent.cost_usd[24h]) by (user_id))

Power users (or abuse) can dominate your spend. Set per-user budgets and catch outliers early.

Alert:Any user exceeds daily budget cap

Cost per Feature

sum(agent.cost_usd[24h]) by (feature)

Which features are expensive? A search feature using GPT-4o might be 10x costlier than one using GPT-4o-mini with no quality difference.

Alert:Feature cost exceeds allocated budget

Dashboard Design Principles

Top-to-bottom severity

Health at the top (first thing you see), quality in the middle, cost at the bottom. In an incident, you read top-down.

Every panel answers one question

If you can't articulate what question a panel answers, delete it. 'What is our current error rate by type?' is good. 'Various metrics' is not.

Alert thresholds visible on panels

Draw horizontal lines on charts at your SLO thresholds. This makes it instantly clear when a metric is approaching danger.

Time range consistency

All panels should use the same time range. A mismatch between a 5-minute error rate and a 24-hour cost total creates confusion.

Drill-down links on every panel

Clicking an error rate panel should take you to filtered traces for those errors. Dashboards are for detection; traces are for investigation.

No more than 3 dashboards per team

An overview dashboard, a debugging dashboard, and a cost dashboard. More than that and nobody looks at any of them.

Example: Agent Health Overview

Requests/min

142

Error Rate

1.8%

p50 Latency

1.2s

p99 Latency

8.4s