The Complete Guide — 2025 Edition

Agent
Observability
Academy

You can't fix what you can't see. Learn how to trace, debug, and monitor AI agents in production — from structured logging to real-time dashboards and cost tracking.

13 Chapters7 Code Examples6 Anti-Patterns10 Resources6 Tool Comparisons
1

What is Agent Observability?

Why logging isn't enough for AI agents

Agent observability is the ability to understand what your AI agent is doing, why it made specific decisions, and how it performs — without deploying new code. It goes far beyond traditional logging.

Traditional software is deterministic: the same input produces the same output, and a stack trace tells you exactly where things broke. AI agents are non-deterministic: the same input can produce different outputs, failures are often subtle (wrong answer, not crash), and the root cause might be in the context the model saw, not in the code.

Why It's Different

Traditional Software Monitoring

  • Errors throw exceptions with stack traces
  • Same input always produces same output
  • Failures are binary: works or crashes
  • Response time is the main performance metric
  • Debugging = reading code + stack trace

Agent Observability

  • Failures are often subtle (wrong answer, not crash)
  • Same input can produce different outputs each time
  • Quality exists on a spectrum, not binary
  • Latency, tokens, cost, and correctness all matter
  • Debugging = understanding what the model saw

Key Insight

When an AI agent produces a bad result, the question isn't “what line of code failed?” — it's “what did the model see when it made that decision?” Agent observability gives you the tools to answer that question.

The Three Pillars

Traces

The full execution path of an agent request — from user input through LLM calls, tool executions, and sub-agent invocations to final output. Traces show you the 'what happened' of every request.

User query -> Plan (200ms) -> Search tool (800ms) -> LLM synthesis (1.2s) -> Response
Metrics

Quantitative measurements aggregated over time — latency distributions, token usage, error rates, costs. Metrics show you the 'how is it performing' across all requests.

p50 latency: 1.2s | p99: 8.4s | error rate: 2.3% | cost/req: $0.04
Logs

Structured event records with context — decisions made, tools selected, errors encountered. Logs show you the 'why did it do that' for individual events.

{"traceId":"abc-123","event":"tool.selected","tool":"search","reason":"query matched knowledge_base pattern","confidence":0.87}

Common Blind Spots Without Observability

Hallucination in the middle of a chain

The LLM hallucinates a fact in step 2 of a 5-step execution. Steps 3-5 build on this false premise. Without tracing, the final output looks wrong, but you can't tell which step introduced the error.

Silent tool failures

A tool returns an empty result instead of throwing an error. The agent continues with no data, producing a vague or generic response. Without observability, this looks like a model quality issue, not a tool failure.

Context window overflow

The agent's conversation history grows past the effective attention window. The model starts ignoring earlier instructions. Performance degrades gradually — no error, no crash, just silently worse outputs.

Cost spikes from retry loops

A tool validation failure triggers a retry loop. The agent makes 15 LLM calls for a single user request, consuming $2 in tokens instead of the expected $0.04. Without cost tracking, this goes unnoticed until the bill arrives.

Observability is not about collecting data. It's about being able to ask arbitrary questions about your system without deploying new code.

Charity Majors

CTO, Honeycomb

The hardest bugs in AI systems are the ones where the model confidently does the wrong thing. Without tracing, you'll never know why.

Harrison Chase

CEO, LangChain

If you can't see what your agent is doing at every step, you don't have a production system. You have a demo.

Jason Liu

Creator, Instructor (structured outputs)

2

Tracing Agent Execution

Following the chain from input to output

Tracing is the backbone of agent observability. A trace captures the entire execution path of a single request — every LLM call, tool execution, and decision point — as a tree of spans with precise timing and metadata.

Without tracing, a multi-step agent is a black box. You see the input and output, but not the 5-10 intermediate steps that produced the result. When something goes wrong, you're guessing which step failed.

Core Concepts

Root Span

The top-level span representing the entire agent execution. Every trace has exactly one root span. Its duration is the total request latency.

Child Span

A sub-operation within a parent span. LLM calls, tool executions, and retrieval steps are typically child spans of the root.

Span Attributes

Key-value metadata attached to a span: model name, token counts, tool name, success/failure, latency, and custom business context.

Span Events

Timestamped log entries within a span. Useful for recording specific moments like 'rate limited, retrying' or 'context window at 80% capacity'.

Anatomy of a Trace Tree

A trace tree shows the parent-child relationships between spans. Each span has a name, duration, and attributes. The tree structure reveals the execution flow and where time is spent.

Example: Agent trace tree (JSON)
// A typical agent trace tree structure
{
  "traceId": "abc-123-def-456",
  "rootSpan": {
    "name": "agent.handle_query",
    "duration": "3,200ms",
    "attributes": { "userId": "usr_42", "model": "gpt-4o" },
    "children": [
      {
        "name": "agent.plan",
        "duration": "450ms",
        "attributes": { "strategy": "decompose", "steps": 3 }
      },
      {
        "name": "agent.step.retrieve",
        "duration": "320ms",
        "attributes": { "tool": "vector_search", "results": 5 }
      },
      {
        "name": "agent.step.llm_call",
        "duration": "1,800ms",
        "attributes": {
          "model": "gpt-4o",
          "inputTokens": 4200,
          "outputTokens": 850,
          "cost": "$0.053"
        }
      },
      {
        "name": "agent.step.tool_call",
        "duration": "280ms",
        "attributes": {
          "tool": "calculator",
          "input": "revenue * 0.15",
          "success": true
        }
      },
      {
        "name": "agent.synthesize",
        "duration": "350ms",
        "attributes": { "outputTokens": 200 }
      }
    ]
  }
}

Reading a trace tree

This trace took 3,200ms total. The LLM call consumed 56% of the time (1,800ms). The planning step generated 3 steps. Vector search found 5 results. The calculator tool succeeded. Total cost was $0.053. If the final answer is wrong, you can pinpoint exactly which step introduced the error.

Trace Waterfall (visual representation)

agent.handle_query
3,200ms
agent.plan
448ms
agent.step.retrieve
320ms
agent.step.llm_call
1792ms
agent.step.tool_call
288ms
agent.synthesize
352ms

The waterfall view instantly reveals that the LLM call dominates latency. Optimizing the LLM step (prompt caching, smaller context, cheaper model) would have the biggest impact.

Implementing Traces with OpenTelemetry

OpenTelemetry (OTEL) is the vendor-neutral standard for distributed tracing. It works with every observability backend — Jaeger, Grafana Tempo, Honeycomb, Langfuse, and more. Here's a production setup for Node.js agents.

OpenTelemetry setup for agent tracing
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { trace, SpanStatusCode, type Span } from "@opentelemetry/api";

// Initialize OpenTelemetry SDK
const sdk = new NodeSDK({
  resource: new Resource({
    "service.name": "my-agent",
    "service.version": "1.0.0",
    "deployment.environment": "production",
  }),
  traceExporter: new OTLPTraceExporter({
    url: "https://otel-collector.example.com/v1/traces",
  }),
});
sdk.start();

// Create a tracer for agent operations
const tracer = trace.getTracer("agent-tracer");

// Helper to wrap agent operations in spans
export async function withSpan<T>(
  name: string,
  attributes: Record<string, string | number | boolean>,
  fn: (span: Span) => Promise<T>,
): Promise<T> {
  return tracer.startActiveSpan(name, async (span) => {
    span.setAttributes(attributes);
    try {
      const result = await fn(span);
      span.setStatus({ code: SpanStatusCode.OK });
      return result;
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error instanceof Error ? error.message : "Unknown error",
      });
      span.recordException(error as Error);
      throw error;
    } finally {
      span.end();
    }
  });
}

// Usage in agent code
async function handleQuery(query: string, userId: string) {
  return withSpan("agent.query", { userId, queryLength: query.length }, async (rootSpan) => {
    // Plan step — automatically a child of rootSpan
    const plan = await withSpan("agent.plan", { query }, async (planSpan) => {
      const result = await planner.createPlan(query);
      planSpan.setAttribute("stepCount", result.steps.length);
      return result;
    });

    // Execute each step
    for (const [i, step] of plan.steps.entries()) {
      await withSpan(
        `agent.step.${step.type}`,
        { stepIndex: i, tool: step.tool ?? "none" },
        async (stepSpan) => {
          const result = await executeStep(step);
          stepSpan.setAttribute("success", result.success);
          stepSpan.setAttribute("outputTokens", result.tokens ?? 0);
          return result;
        },
      );
    }
  });
}

Tracing Best Practices

  • Name spans by operation, not implementation: Use "agent.plan" not "callGPT4ForPlanning". You want stable names even when you swap models.
  • Always record token counts and model name on LLM spans: This is essential for cost attribution and performance analysis.
  • Record tool inputs and outputs (truncated) on tool spans: When a tool returns bad data, you need to see what it received and returned.
  • Use semantic conventions: Follow OpenTelemetry semantic conventions for GenAI when they become stable. This ensures compatibility across tools.
3

Structured Logging

Logs that machines and humans can read

Structured logging means emitting logs as typed, machine-parseable JSON objects instead of freeform text strings. Every log entry has a consistent schema with fields for trace ID, event type, timestamps, and relevant metadata.

The difference is stark: console.log("Tool failed: search") is a dead end for debugging. {"event":"tool.error","tool":"search","error":"timeout","latencyMs":5000,"traceId":"abc"} can be queried, aggregated, alerted on, and correlated with other events.

Structured vs Unstructured

Unstructured (text strings)

[2025-07-15 14:32:01] Agent started [2025-07-15 14:32:01] Input: What's my balance? [2025-07-15 14:32:02] Calling GPT-4o... [2025-07-15 14:32:04] Got response (took 2.1s) [2025-07-15 14:32:04] Using 4521 tokens [2025-07-15 14:32:04] Calling tool: get_account [2025-07-15 14:32:05] Tool returned data [2025-07-15 14:32:05] Done!

Can't filter, can't aggregate, can't correlate across requests. Finding "all requests slower than 5 seconds" means parsing text.

Structured (JSON objects)

{"traceId":"a1b","event":"agent.start","ts":1721...} {"traceId":"a1b","event":"llm.call","model":"gpt-4o", "latencyMs":2100,"tokens":4521,"cost":0.032} {"traceId":"a1b","event":"tool.call","tool":"get_account", "latencyMs":890,"success":true} {"traceId":"a1b","event":"agent.complete", "totalLatencyMs":3200,"totalCost":0.032}

Every field is queryable. Find slow requests: latencyMs > 5000. Cost by model: sum(cost) group by model.

Log Levels for AI Agents

Traditional log levels need reinterpretation for agents. A "warning" isn't just "something might be wrong" — it's "the agent is degrading in a way that affects user experience."

ERROR

When to use: Unrecoverable failures — LLM API errors, tool crashes, rate limit exhaustion, context window overflow

{"level":"error","traceId":"abc","event":"llm.error","error":"RateLimitExceeded","model":"gpt-4o","retries":3}
WARN

When to use: Degraded behavior — slow responses, high token usage, tool retries, fallback to cheaper model, approaching context limits

{"level":"warn","traceId":"abc","event":"context.high_usage","tokensUsed":95000,"limit":128000,"percentUsed":74}
INFO

When to use: Normal operations — request start/end, LLM call complete, tool execution success, agent step transitions

{"level":"info","traceId":"abc","event":"tool.executed","tool":"search","latencyMs":245,"success":true,"results":5}
DEBUG

When to use: Detailed diagnostics — full prompts, retrieved documents, model parameters, embedding scores (disabled in production unless sampling)

{"level":"debug","traceId":"abc","event":"rag.results","query":"refund policy","topScore":0.92,"chunks":["...truncated"]}

Implementation with Pino

Pino is the fastest structured logging library for Node.js — zero overhead in production. Use child loggers to propagate trace context without manually passing IDs everywhere.

Structured logging with Pino
import pino from "pino";

// Create a base logger with default fields
const baseLogger = pino({
  level: process.env.LOG_LEVEL ?? "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  base: {
    service: "my-agent",
    version: "1.0.0",
    environment: process.env.NODE_ENV,
  },
});

// Create a request-scoped logger with trace context
export function createRequestLogger(traceId: string, userId?: string) {
  return baseLogger.child({
    traceId,
    userId,
    requestStartedAt: Date.now(),
  });
}

// Usage in agent code
async function handleRequest(input: string) {
  const traceId = crypto.randomUUID();
  const log = createRequestLogger(traceId, "usr_42");

  log.info({
    event: "agent.request.start",
    inputLength: input.length,
    inputPreview: input.slice(0, 100),
  });

  const llmStart = performance.now();
  const response = await llm.generate({
    model: "gpt-4o",
    messages: [{ role: "user", content: input }],
  });

  log.info({
    event: "llm.call.complete",
    model: "gpt-4o",
    latencyMs: Math.round(performance.now() - llmStart),
    inputTokens: response.usage.prompt_tokens,
    outputTokens: response.usage.completion_tokens,
    totalTokens: response.usage.total_tokens,
    finishReason: response.choices[0].finish_reason,
  });

  if (response.usage.total_tokens > 10000) {
    log.warn({
      event: "llm.high_token_usage",
      totalTokens: response.usage.total_tokens,
      threshold: 10000,
    });
  }

  return response;
}

Correlation IDs with AsyncLocalStorage

The hardest part of logging is propagating trace context through deeply nested function calls. Node.js's AsyncLocalStorage solves this — any function in the async call stack can access the current request's trace context without explicit parameter passing.

Automatic context propagation
// Correlation: connecting logs across services
import { AsyncLocalStorage } from "node:async_hooks";

interface RequestContext {
  traceId: string;
  spanId: string;
  userId: string;
  feature: string;
}

const contextStore = new AsyncLocalStorage<RequestContext>();

// Middleware: set context for entire request lifecycle
export function withRequestContext(
  traceId: string,
  userId: string,
  feature: string,
  fn: () => Promise<void>,
) {
  const ctx: RequestContext = {
    traceId,
    spanId: crypto.randomUUID().slice(0, 8),
    userId,
    feature,
  };

  return contextStore.run(ctx, fn);
}

// Logger automatically includes request context
export function getLogger() {
  const ctx = contextStore.getStore();
  if (!ctx) return baseLogger;

  return baseLogger.child({
    traceId: ctx.traceId,
    spanId: ctx.spanId,
    userId: ctx.userId,
    feature: ctx.feature,
  });
}

// Any function in the call stack can get a correlated logger
async function searchDocuments(query: string) {
  const log = getLogger(); // Automatically has traceId, userId, etc.
  log.info({ event: "rag.search.start", query });

  const results = await vectorDB.search(query);
  log.info({
    event: "rag.search.complete",
    resultCount: results.length,
    topScore: results[0]?.score,
  });

  return results;
}

What to Log (and What Not To)

Always Log

  • Request start/end with latency
  • LLM calls: model, tokens, latency, finish reason
  • Tool calls: name, latency, success/failure
  • Errors: type, message, retry count
  • Cost: per-request calculated cost
  • Decisions: why a tool or model was selected

Never Log (or log with caution)

  • Full user messages (PII risk — truncate or hash)
  • Full LLM outputs (storage cost — log length only)
  • API keys or auth tokens
  • Full RAG document contents (log metadata only)
  • Full system prompts (version reference instead)
4

Metrics & KPIs for Agents

Latency, token usage, success rates, cost

Agent metrics go far beyond response time and error rate. You need to track latency distributions, token consumption, per-request cost, tool call accuracy, and response quality — all broken down by model, feature, and user segment.

The goal is to answer three questions at any time: Is the agent working correctly? How much is it costing? Where are the bottlenecks?

Latency

Time to First Token (TTFT)

How long until the user sees the first token of the response. Critical for perceived performance in streaming UIs.

Target:< 500ms for streaming, < 2s for non-streaming

Total Response Latency

End-to-end time from user input to complete response. Includes planning, tool calls, LLM generation, and post-processing.

Target:p50 < 2s, p95 < 5s, p99 < 15s

Tool Call Latency

Time for each tool execution. Slow tools dominate agent latency since LLM calls block on tool results.

Target:p95 < 1s per tool call
Token Usage

Input Tokens per Request

How much context the model receives. Tracks system prompt size, conversation history, RAG documents, and tool definitions. Directly correlates with cost.

Target:Monitor trend — should not grow unbounded

Output Tokens per Request

How verbose the model's response is. High output tokens may indicate the model is over-explaining or generating unnecessary content.

Target:Model-specific, typically 200-2000 per response

Token Efficiency Ratio

Useful output tokens / total input tokens consumed. A falling ratio means you're stuffing more context for the same quality of output.

Target:> 0.1 (10% of input results in useful output)
Cost

Cost per Request

Total LLM API cost for a single user request, including all LLM calls, retries, and sub-agent invocations.

Target:Define per feature — e.g., simple Q&A < $0.01, complex analysis < $0.10

Cost per Conversation

Total cost across all messages in a conversation. Rises with conversation length due to growing context windows.

Target:Set per-conversation budget caps

Cost per User (daily/monthly)

Aggregated cost attributed to individual users. Identifies power users and potential abuse.

Target:Set per-user daily limits with automatic throttling
Quality & Reliability

Success Rate (verified)

Percentage of requests that produce a verified-correct response. Not just 'didn't throw an error' — use eval checks for actual correctness.

Target:> 95% for production systems

Tool Call Accuracy

Percentage of tool calls where the agent selected the correct tool with valid arguments. Wrong tool selection is a major failure mode.

Target:> 98% correct tool selection

Hallucination Rate

Percentage of responses flagged by automated eval checks as containing fabricated information. Requires running evals on sampled production traffic.

Target:< 2% for factual queries

Retry Rate

Percentage of requests that required at least one retry (LLM or tool). High retry rates indicate instability or poor prompt design.

Target:< 5% of requests require retries

Why Percentiles, Not Averages

An average latency of 1.5 seconds sounds fine. But if your p99 is 25 seconds, 1 in 100 users waits 25 seconds. With 10,000 daily requests, that's 100 terrible experiences per day. Percentiles (p50, p95, p99) reveal the full distribution. Use histogram metric types to compute percentiles efficiently.

Typical Agent Latency Budget (3-second target)

Query preprocessing
90ms
RAG retrieval
300ms
Context assembly
60ms
LLM generation
1,800ms
Tool execution
510ms
Post-processing
90ms
Network/overhead
150ms

LLM generation dominates at 60% of total latency. Optimize the prompt/context size first — a 50% reduction in input tokens can reduce LLM latency by 30-40%.

Implementation with OpenTelemetry Metrics

Agent metrics collection
import { metrics } from "@opentelemetry/api";

const meter = metrics.getMeter("agent-metrics");

// Latency histograms (use buckets that match your SLOs)
const requestLatency = meter.createHistogram("agent.request.latency_ms", {
  description: "End-to-end request latency",
  unit: "ms",
});

const ttft = meter.createHistogram("agent.ttft_ms", {
  description: "Time to first token",
  unit: "ms",
});

const toolLatency = meter.createHistogram("agent.tool.latency_ms", {
  description: "Individual tool call latency",
  unit: "ms",
});

// Token counters
const inputTokens = meter.createHistogram("agent.tokens.input", {
  description: "Input tokens per request",
});

const outputTokens = meter.createHistogram("agent.tokens.output", {
  description: "Output tokens per request",
});

// Cost gauge
const requestCost = meter.createHistogram("agent.cost_usd", {
  description: "Cost per request in USD",
});

// Quality counters
const requestCounter = meter.createCounter("agent.requests.total", {
  description: "Total requests by status",
});

const toolCallCounter = meter.createCounter("agent.tool_calls.total", {
  description: "Tool calls by tool and status",
});

// Recording metrics in agent code
async function handleRequest(input: string, context: RequestContext) {
  const start = performance.now();
  const labels = {
    model: context.model,
    feature: context.feature,
  };

  try {
    const result = await agent.execute(input, context);
    const latencyMs = performance.now() - start;

    requestLatency.record(latencyMs, { ...labels, status: "success" });
    inputTokens.record(result.usage.inputTokens, labels);
    outputTokens.record(result.usage.outputTokens, labels);
    requestCost.record(result.estimatedCost, labels);
    requestCounter.add(1, { ...labels, status: "success" });

    if (result.ttftMs) {
      ttft.record(result.ttftMs, labels);
    }

    for (const tool of result.toolCalls) {
      toolCallCounter.add(1, {
        tool: tool.name,
        success: String(tool.success),
      });
      toolLatency.record(tool.latencyMs, { tool: tool.name });
    }

    return result;
  } catch (error) {
    requestLatency.record(performance.now() - start, { ...labels, status: "error" });
    requestCounter.add(1, { ...labels, status: "error" });
    throw error;
  }
}
5

Debugging Agent Failures

Root cause analysis for non-deterministic systems

Debugging AI agents is fundamentally different from debugging traditional software. There are no stack traces for wrong answers, no breakpoints for hallucinations. The root cause often lies in what the model saw, not in the code that ran.

Non-determinism makes reproduction difficult — the same input can produce different outputs on each run. Without captured execution context, you cannot reliably investigate failures. This is why observability is a prerequisite for debugging, not an afterthought.

The Four Categories of Agent Failure

Every agent failure falls into one of these categories. Classifying the failure type is the first step to fixing it — each category has a different debugging approach and different fix.

Model Failures

The LLM itself produces incorrect output despite receiving good context. Includes hallucinations, reasoning errors, instruction non-compliance, and format violations.

How to Debug

  1. 1.Compare the model's input (full context window) against its output
  2. 2.Check if the answer contradicts information in the context
  3. 3.Verify that instructions in the system prompt were followed
  4. 4.Test with a different model to isolate model-specific issues
Example: Model claims the refund policy allows 90-day returns, but the RAG document clearly states 30 days. The context was correct; the model hallucinated.
Context Failures

The model produced a reasonable output given what it saw, but it saw the wrong information. Includes missing documents, irrelevant RAG results, stale data, and context overflow.

How to Debug

  1. 1.Inspect the exact context window contents at the time of generation
  2. 2.Verify RAG retrieval quality — were the right documents retrieved?
  3. 3.Check token counts — was the context window overflowing?
  4. 4.Look for contradictions between retrieved documents
Example: Agent gives outdated pricing. Trace shows the RAG system retrieved a 2023 pricing document instead of the current 2025 version. The model answered correctly based on what it saw.
Tool Failures

A tool returns incorrect data, times out, or fails silently. The agent continues with bad or missing data, producing a subtly wrong response.

How to Debug

  1. 1.Check tool span: what input did the tool receive? What did it return?
  2. 2.Look for empty or null tool responses that weren't flagged as errors
  3. 3.Verify tool response latency — did it timeout?
  4. 4.Compare tool output against ground truth data
Example: Agent says 'you have no open orders' because the database tool returned an empty array due to a permissions error it didn't surface. The trace shows tool.success=true but results=0.
Orchestration Failures

The agent's planning or routing logic made a poor decision. Wrong tool selected, unnecessary steps taken, infinite loops, or premature termination.

How to Debug

  1. 1.Examine the plan span: what strategy did the agent choose and why?
  2. 2.Count the number of steps — is it reasonable for this query?
  3. 3.Look for repeated tool calls that suggest a retry loop
  4. 4.Check if the agent terminated before completing all necessary steps
Example: Agent routes a complex billing question to the FAQ search tool instead of the account lookup tool, returning a generic answer instead of the user's specific billing data.

The 6-Step Debugging Workflow

1

Start from the trace

Pull up the trace for the failing request using its trace ID. The trace tree gives you the full execution path.

2

Identify the failing span

Walk the trace tree. Which span produced the first incorrect output? Was it a planning step, an LLM call, or a tool execution?

3

Inspect inputs and outputs

For the failing span, examine what went in (context, parameters) and what came out (response, tool result). The bug is in the gap between expected and actual.

4

Classify the failure

Is it a model failure (bad output given good context), context failure (bad context), tool failure (bad data), or orchestration failure (bad plan)?

5

Reproduce with replay

Use the stored execution context (input, messages, tool results) to replay the exact same request. If you can reproduce it, you can fix it.

6

Fix and add a regression eval

Apply the fix and add this case to your evaluation dataset. Future changes will be tested against this failure case.

Replay Debugging: Reproducing Non-Deterministic Failures

The hardest part of debugging AI agents is that you can't just "re-run it with the same input" — the model might give a different (correct) answer the second time. Replay debugging solves this by capturing and storing the exact execution context so you can feed it back to the model later.

Replay debugging: capture and replay execution context
// Store execution context for replay debugging
interface ExecutionSnapshot {
  traceId: string;
  input: string;
  systemPrompt: string;
  messages: Message[];
  toolResults: Record<string, unknown>;
  retrievedDocuments: Document[];
  modelConfig: { model: string; temperature: number };
  timestamp: number;
}

// Capture on every request (or sample in production)
async function captureSnapshot(
  traceId: string,
  context: AgentContext,
): Promise<void> {
  const snapshot: ExecutionSnapshot = {
    traceId,
    input: context.input,
    systemPrompt: context.systemPrompt,
    messages: context.messages,
    toolResults: context.toolResults,
    retrievedDocuments: context.retrievedDocs,
    modelConfig: {
      model: context.model,
      temperature: context.temperature,
    },
    timestamp: Date.now(),
  };

  // Store in debug database (TTL: 30 days)
  await debugStore.save(traceId, snapshot, { ttlDays: 30 });
}

// Replay a failing request with the exact same context
async function replayRequest(traceId: string): Promise<ReplayResult> {
  const snapshot = await debugStore.get(traceId);
  if (!snapshot) throw new Error(`No snapshot for trace ${traceId}`);

  // Reconstruct the exact context the model saw
  const replayResponse = await llm.generate({
    model: snapshot.modelConfig.model,
    temperature: snapshot.modelConfig.temperature,
    system: snapshot.systemPrompt,
    messages: snapshot.messages,
  });

  return {
    originalTraceId: traceId,
    replayTraceId: crypto.randomUUID(),
    originalOutput: snapshot.messages.at(-1)?.content,
    replayOutput: replayResponse.text,
    matched: replayResponse.text === snapshot.messages.at(-1)?.content,
  };
}

The Debugging Mindset Shift

In traditional software, you debug the code. In AI agents, you debug the context. The model is usually capable of producing the right answer — the question is whether it received the right information. Every debugging session should start with: "What did the model see when it made this decision?"

6

Observability Tools

LangSmith, Phoenix, Langfuse, Braintrust

The agent observability ecosystem has matured rapidly. From open-source platforms to managed services, there are now dedicated tools for every aspect of LLM monitoring. Here is a comprehensive comparison of the leading options.

The right tool depends on your constraints: team size, budget, existing infrastructure, and whether you need self-hosting. Most teams start with one managed platform and add OpenTelemetry as they scale.

LangChainLangSmith

Production monitoring and debugging platform from the LangChain team. Deep integration with LangChain and LangGraph, but works with any framework via the LangSmith SDK. Strong trace visualization with a playground for prompt iteration.

Key Features

  • Trace visualization with span-level detail
  • Prompt playground for iteration and testing
  • Dataset management for evaluations
  • Online evaluation runners
  • Human annotation queues
  • Comparison views for A/B testing prompts
Pricing: Free tier (5K traces/mo), Plus ($39/seat/mo), Enterprise (custom)
Best For: Teams already using LangChain/LangGraph, or needing strong eval integration.
Arize AIArize Phoenix

Open-source observability platform built specifically for LLM applications. Strong focus on trace visualization, retrieval analysis, and evaluation. Runs locally or in the cloud. Great for debugging RAG systems.

Key Features

  • Open-source (Apache 2.0) — self-host anywhere
  • Trace and span visualization
  • Retrieval (RAG) quality analysis
  • Embedding drift detection
  • LLM evaluation framework
  • OpenTelemetry-native instrumentation
Pricing: Free (open-source self-hosted), Phoenix Cloud (free tier + paid plans)
Best For: Teams wanting open-source with RAG debugging, or needing on-premise deployment.
LangfuseLangfuse

Open-source LLM observability and analytics platform. Framework-agnostic with SDKs for Python and TypeScript. Includes prompt management, cost tracking, and evaluation features. Can be self-hosted.

Key Features

  • Open-source (MIT) — full self-hosting support
  • Framework-agnostic SDK (works with any LLM provider)
  • Prompt management and versioning
  • Cost tracking with model pricing tables
  • Evaluation and scoring pipelines
  • Session and user-level analytics
Pricing: Free (self-hosted or cloud free tier), Pro ($59/mo), Team ($119/mo)
Best For: Teams wanting open-source with strong cost tracking and prompt management.
BraintrustBraintrust

End-to-end platform combining evaluations, logging, and prompt playground. Strong focus on eval-driven development with built-in scoring functions. Also serves as an AI proxy for cost optimization.

Key Features

  • Eval framework with built-in scoring functions
  • Production logging and tracing
  • Prompt playground with side-by-side comparison
  • AI proxy with caching and fallbacks
  • Dataset management for golden test sets
  • CI/CD integration for automated eval runs
Pricing: Free tier, Pro ($25/seat/mo), Enterprise (custom)
Best For: Teams focused on eval-driven development and needing a unified eval + monitoring platform.
HeliconeHelicone

Proxy-based observability that requires just one line of code to integrate. Routes your LLM API calls through Helicone's gateway, automatically capturing cost, latency, and usage metrics without SDK changes.

Key Features

  • One-line integration (URL swap)
  • Automatic cost and latency tracking
  • Request/response logging
  • Rate limiting and caching at the proxy layer
  • User-level analytics and cost attribution
  • Prompt threat detection
Pricing: Free tier (100K requests/mo), Growth ($100/mo), Enterprise (custom)
Best For: Teams wanting instant observability with minimal code changes.
CNCF (Open Standard)OpenTelemetry + Custom Stack

Build your own observability stack using the OpenTelemetry standard. Use OTEL SDKs for instrumentation and export to any backend — Jaeger, Grafana Tempo, Honeycomb, Datadog, or custom storage. Maximum flexibility, but requires more setup.

Key Features

  • Vendor-neutral standard — no lock-in
  • Works with any observability backend
  • OpenLLMetry provides auto-instrumentation for LLM frameworks
  • Full control over what you trace and store
  • Integrates with existing infrastructure
  • Semantic conventions for GenAI (emerging standard)
Pricing: Free (open standard) — backend costs vary
Best For: Teams with existing observability infrastructure, or needing full control over data.
Quick Comparison
DimensionLangSmithArizeLangfuseBraintrustHeliconeOpenTelemetry
Open SourceNoYes (Apache 2.0)Yes (MIT)NoNoYes (CNCF)
Self-HostableNoYesYesNoNoYes
TracingStrongStrongStrongGoodBasicExcellent
EvalsStrongGoodGoodExcellentBasicManual
Cost TrackingGoodBasicStrongGoodExcellentManual
Setup EffortLowLowLowLowMinimalHigh

How to Choose

  • Starting out? Pick Langfuse (open-source, generous free tier) or Helicone (one-line setup).
  • Using LangChain/LangGraph? LangSmith has the deepest integration and best trace UI for these frameworks.
  • Need evals + monitoring? Braintrust combines both in one platform, reducing tool sprawl.
  • Enterprise with existing infra? OpenTelemetry + your existing backend (Datadog, Grafana) avoids adding another vendor.
  • Privacy/compliance requirements? Langfuse or Phoenix can be self-hosted in your own infrastructure.
7

Building Dashboards

Real-time visibility into agent performance

A well-designed dashboard is the command center for your agent in production. It should answer three questions at a glance: Is it working? Is it correct? Is it affordable?

The difference between a useful dashboard and a vanity dashboard is actionability. Every panel should either tell you things are fine, or tell you exactly where to look when they are not. If a metric has never triggered an investigation, it does not belong on the dashboard.

RowHealth & Availability

The first row every engineer looks at. Answers: 'Is the agent working right now?' If any panel is red, investigate immediately.

Request Rate (RPM)

rate(agent.requests[5m])

A sudden drop means something is broken upstream. A spike means unexpected traffic or a retry storm.

Alert:< 50% of baseline for 5 minutes

Error Rate by Type

rate(agent.errors[5m]) by (error_type)

Not just 'errors happened' but which type: rate limits, timeouts, tool failures, or model errors. Each has a different fix.

Alert:> 5% error rate sustained for 10 minutes

Latency Distribution (p50/p95/p99)

histogram_quantile([0.5, 0.95, 0.99], agent.latency)

p50 tells you the typical experience. p99 tells you the worst experience. If p99 diverges from p50, you have outlier issues.

Alert:p99 > 15s for 10 minutes

Active Agent Sessions

count(active_sessions)

Track concurrency. If sessions spike, you may hit rate limits or exhaust your API quota.

Alert:Informational — no alert needed
RowQuality & Correctness

The hardest row to build but the most important. Answers: 'Are the agent's responses actually good?' Requires eval pipelines on sampled traffic.

Tool Call Success Rate by Tool

rate(agent.tool_calls{success=true}[5m]) by (tool_name)

If one specific tool is failing while others are fine, the problem is localized. Drill down by tool name.

Alert:Any tool drops below 90% success for 15 minutes

Eval-Flagged Hallucination Rate

rate(agent.eval{result=hallucination}[1h])

Automated evals on sampled production traffic detect hallucinations. A rising rate means the model or context quality is degrading.

Alert:> 3% hallucination rate over 1 hour

Completion Rate

rate(agent.completed[5m]) / rate(agent.started[5m])

What percentage of requests get a full response? Low completion means early terminations, timeouts, or context window overflows.

Alert:Below 95% completion rate

User Feedback Score

avg(agent.user_feedback) by (feature)

Thumbs up/down or star ratings from users. The ground truth signal for quality, but sparse and delayed.

Alert:Trend detection — alert on significant drops
RowCost & Efficiency

Answers: 'How much is the agent costing, and where is the money going?' Essential for budget planning and catching runaway costs before the bill arrives.

Cost per Hour by Model

sum(agent.cost_usd[1h]) by (model)

Identifies which models dominate your spend. If GPT-4o cost suddenly spikes, a prompt change may have increased token usage.

Alert:Hourly cost > 2x the 7-day average

Tokens per Request (Input vs Output)

avg(agent.tokens) by (type)

Rising input tokens = growing context. Rising output tokens = verbosity. Both directly increase cost.

Alert:Average input tokens grows > 20% week-over-week

Cost per User (Top 10)

topk(10, sum(agent.cost_usd[24h]) by (user_id))

Power users (or abuse) can dominate your spend. Set per-user budgets and catch outliers early.

Alert:Any user exceeds daily budget cap

Cost per Feature

sum(agent.cost_usd[24h]) by (feature)

Which features are expensive? A search feature using GPT-4o might be 10x costlier than one using GPT-4o-mini with no quality difference.

Alert:Feature cost exceeds allocated budget

Dashboard Design Principles

Top-to-bottom severity

Health at the top (first thing you see), quality in the middle, cost at the bottom. In an incident, you read top-down.

Every panel answers one question

If you can't articulate what question a panel answers, delete it. 'What is our current error rate by type?' is good. 'Various metrics' is not.

Alert thresholds visible on panels

Draw horizontal lines on charts at your SLO thresholds. This makes it instantly clear when a metric is approaching danger.

Time range consistency

All panels should use the same time range. A mismatch between a 5-minute error rate and a 24-hour cost total creates confusion.

Drill-down links on every panel

Clicking an error rate panel should take you to filtered traces for those errors. Dashboards are for detection; traces are for investigation.

No more than 3 dashboards per team

An overview dashboard, a debugging dashboard, and a cost dashboard. More than that and nobody looks at any of them.

Example: Agent Health Overview

Requests/min

142

Error Rate

1.8%

p50 Latency

1.2s

p99 Latency

8.4s

14:00
130
14:05
145
14:10
142
14:15
155
14:20
138
14:25
148

A healthy dashboard at a glance: stable request rate, low error rate, acceptable p50, and a p99 that deserves investigation (shown in yellow).

The One-Dashboard Rule

If your team has to look at more than one dashboard to answer "is the agent healthy?", you have too many dashboards. Start with a single overview that covers health, quality, and cost. Only split into separate dashboards when the single view becomes too crowded to be useful at a glance.

8

Production Monitoring

Alerts, anomaly detection, and incident response

Production monitoring for AI agents requires a fundamentally different approach than traditional services. Agents fail in subtle ways — wrong answers, slowly degrading quality, invisible cost creep — that standard uptime checks will never catch.

The foundation is SLO-based alerting: define what "good" looks like (Service Level Objectives), measure against it continuously, and alert based on how fast you are consuming your error budget — not on individual errors.

SLO-Based Alerting

Agent Response Quality SLO

Target: 99.5% of requests complete successfully in < 10 seconds

Error budget: 0.5% = ~3.6 hours of downtime per month

WindowBurn RateActionSeverity
1 hour14.4xPage on-callcritical
6 hours6xSlack alertwarning
24 hours3xTicketinfo

A fast burn (14.4x) means you will exhaust your monthly error budget in about 1 hour if the rate continues. This pages immediately. A slow burn (6x) gives you 5 hours, warranting a Slack alert.

Alert Severity Tiers

The most important rule in alerting: if everything is critical, nothing is critical. Tier your alerts by severity, and make sure each tier has a clear response expectation and routing.

CriticalResponse: < 5 minutes

Action: Page on-call engineer immediately

Example Triggers

  • Error budget burn rate > 14.4x monthly budget (1-hour window)
  • Zero successful requests for 5+ minutes (total outage)
  • p99 latency exceeds 30 seconds for 10+ minutes
  • LLM API returning 500s for all requests
WarningResponse: < 1 hour during business hours

Action: Notify team via Slack, investigate during business hours

Example Triggers

  • Error budget burn rate > 6x monthly budget (6-hour window)
  • Hourly cost exceeds 2x the 7-day average
  • Hallucination rate above 3% (detected via evals)
  • Token usage trending upward 20%+ week-over-week
InfoResponse: Review at next standup

Action: Log for review, include in weekly reports

Example Triggers

  • New model version deployed successfully
  • Daily cost report summary
  • Eval scores trending above baseline
  • Rate limit approached but not exceeded

Common Anomaly Patterns in Agent Systems

These are the subtle failure modes that simple threshold alerts miss. Each requires a different detection strategy — often comparing current behavior against historical baselines rather than fixed thresholds.

Token Usage Drift

Average input tokens per request gradually increasing over days/weeks. Usually caused by growing conversation histories, expanding system prompts, or RAG returning more chunks.

Detection

Compare 7-day rolling average against 30-day baseline. Alert on > 20% increase.

Impact

Cost increases linearly. Quality may degrade as context window fills. Eventually hits model limits.

Latency Bimodality

Latency distribution develops two peaks instead of one. Some requests complete in 1-2s while others take 10-15s. Often caused by tool timeouts or cache miss patterns.

Detection

Monitor the gap between p50 and p95. When p95 > 5x p50, investigate.

Impact

Inconsistent user experience. Slow requests may timeout at the client level.

Silent Quality Degradation

Success rate (non-error) stays high but eval-measured quality drops. The model is returning answers that don't throw errors but are subtly wrong or less helpful.

Detection

Run automated evals on 5-10% of production traffic. Track eval scores over time.

Impact

Users lose trust gradually. By the time you notice from feedback, many users have churned.

Cost Spike from Retry Storms

A tool or model intermittently fails, triggering retry logic. Each retry makes a full LLM call, multiplying cost. A single user request can generate 10-20 LLM calls.

Detection

Track retry count per request. Alert when any request exceeds 3 retries.

Impact

5-20x cost increase per affected request. Can exhaust rate limits, cascading to other users.

Agent-Specific Incident Response

Every alert needs a runbook. For AI agents, runbooks must include checking the LLM provider status, examining recent prompt/model changes, and verifying context quality — not just checking server health.

Incident playbooks and automated health checks
// Agent-specific incident response playbook
interface IncidentPlaybook {
  symptom: string;
  severity: "critical" | "warning";
  steps: string[];
  escalation: string;
}

const playbooks: IncidentPlaybook[] = [
  {
    symptom: "Error rate exceeds SLO",
    severity: "critical",
    steps: [
      "Check LLM provider status page (OpenAI, Anthropic)",
      "Review error type breakdown — is it rate limits, timeouts, or 500s?",
      "If rate limits: enable fallback model routing",
      "If timeouts: check tool service health",
      "If 500s: check for recent deployment (prompt/model change)",
      "If no provider issue: check recent code/config changes",
    ],
    escalation: "If unresolved in 15 minutes, escalate to platform team",
  },
  {
    symptom: "Cost anomaly detected (2x+ hourly average)",
    severity: "warning",
    steps: [
      "Identify which model/feature is driving the spike",
      "Check for retry storms: look for requests with retries > 3",
      "Check for token usage spikes: was a system prompt changed?",
      "Check for traffic spikes: is it legitimate traffic or abuse?",
      "If abuse: enable per-user rate limiting",
      "If retry storm: fix the failing tool/model and reset",
    ],
    escalation: "If cost exceeds $500/hour, page finance + engineering lead",
  },
  {
    symptom: "Hallucination rate spike",
    severity: "warning",
    steps: [
      "Check if a new model version was deployed",
      "Verify RAG retrieval quality — are correct docs being returned?",
      "Check system prompt: was it recently modified?",
      "Review recent eval results for regression patterns",
      "If model change: rollback to previous model version",
      "If context issue: investigate RAG pipeline health",
    ],
    escalation: "If affects > 5% of requests, escalate to AI/ML team",
  },
];

// Automated incident detection
async function checkAgentHealth(
  metricsClient: MetricsClient,
): Promise<IncidentAlert[]> {
  const alerts: IncidentAlert[] = [];
  const now = Date.now();

  // Check error budget burn rate
  const errorRate1h = await metricsClient.query(
    "rate(agent.errors[1h]) / rate(agent.requests[1h])"
  );
  const monthlyBudget = 0.01; // 99% SLO = 1% error budget
  if (errorRate1h > monthlyBudget * 14.4) {
    alerts.push({
      type: "error_budget_fast_burn",
      severity: "critical",
      message: `Error budget burning at ${(errorRate1h / monthlyBudget).toFixed(1)}x monthly rate`,
      runbook: "https://wiki/runbooks/agent-error-budget",
      timestamp: now,
    });
  }

  // Check cost anomaly
  const hourlyCost = await metricsClient.query("sum(agent.cost_usd[1h])");
  const avgHourlyCost7d = await metricsClient.query(
    "avg_over_time(sum(agent.cost_usd[1h])[7d:])"
  );
  if (hourlyCost > avgHourlyCost7d * 2) {
    alerts.push({
      type: "cost_anomaly",
      severity: "warning",
      message: `Hourly cost $${hourlyCost.toFixed(2)} is ${(hourlyCost / avgHourlyCost7d).toFixed(1)}x the 7-day average`,
      runbook: "https://wiki/runbooks/agent-cost",
      timestamp: now,
    });
  }

  return alerts;
}

The Post-Incident Review Question

Traditional post-mortems ask "what code failed?" Agent post-incident reviews need additional questions: What context did the agent see? Was it a model failure or a context failure? Would better observability have caught this earlier? Always add the failing case to your eval dataset.

9

Cost Tracking & Optimization

Know where every token dollar goes

LLM costs are the most unpredictable line item in your infrastructure budget. Unlike traditional compute where costs scale linearly with traffic, agent costs can spike 10-50x based on prompt changes, retry loops, or context window growth — often without anyone noticing until the bill arrives.

The solution is per-request cost attribution: know exactly how much every request costs, attribute it to a user, feature, and model, and set alerts before spending spirals. You need to answer "where is every token dollar going?" at any time.

Typical Agent Cost Breakdown

LLM Input Tokens
35%
LLM Output Tokens
25%
Retries & Fallbacks
15%
Embedding Generation
10%
Tool API Calls
10%
Infrastructure
5%

LLM tokens (input + output) dominate at 60%. Output tokens are particularly expensive — 3-5x the cost of input tokens for most models. Retries add 15% on average; optimizing retry logic has outsized cost impact.

Model Pricing Reference (per 1M tokens)
ModelInputOutputBest For
GPT-4o$2.50$10.00Complex reasoning, coding
GPT-4o-mini$0.15$0.60Simple tasks, classification
Claude Sonnet 4$3.00$15.00Complex analysis, coding
Claude Haiku 3.5$0.80$4.00Quick tasks, routing
Gemini 2.0 Flash$0.10$0.40High-volume, simple

Output tokens cost 3-5x more than input tokens. A 1,000-token response from GPT-4o costs $0.01 — but that's $10K/month at 1M requests. Pricing as of early 2025; check provider sites for current rates.

Cost Optimization Strategies

These five strategies, applied together, can reduce agent costs by 60-80% without meaningful quality degradation. Start with model routing — it typically has the highest ROI.

40-70% cost reductionModel Routing by Complexity

Not every request needs GPT-4o. Route simple queries (FAQ, classification, extraction) to cheaper models and reserve expensive models for complex reasoning. Use a lightweight classifier to determine complexity.

Implementation Steps

  1. 1.Build a complexity classifier (can be rule-based or a cheap LLM call)
  2. 2.Route simple queries to GPT-4o-mini or Haiku ($0.15-0.80/MTok vs $2.50-3.00/MTok)
  3. 3.Route complex queries to GPT-4o or Claude Sonnet
  4. 4.Track quality metrics per route to ensure cheap models aren't degrading output
50-90% reduction for repeated prefixesPrompt Caching

If your system prompt and tool definitions are identical across requests, prompt caching lets you pay for those tokens once instead of every request. Both Anthropic and OpenAI offer native prompt caching.

Implementation Steps

  1. 1.Structure prompts with static prefix (system prompt, tools) and dynamic suffix (user input)
  2. 2.Enable prompt caching on your LLM provider (automatic for Anthropic, explicit for OpenAI)
  3. 3.Monitor cache hit rate — should be > 80% for repetitive workloads
  4. 4.Keep the static prefix stable — changing it invalidates the cache
20-50% cost reductionContext Window Optimization

Every token in the context window costs money. Reduce input tokens by compressing conversation history, limiting RAG results, and trimming tool definitions to only what's needed for each request.

Implementation Steps

  1. 1.Implement conversation compaction at 60-80% of context window capacity
  2. 2.Limit RAG retrieval to top-3 chunks with minimum relevance threshold of 0.7
  3. 3.Use dynamic tool selection to include only relevant tool definitions
  4. 4.Compress tool outputs — return structured summaries instead of raw API responses
Prevents runaway costs entirelyRequest-Level Budget Caps

Set maximum token and cost budgets per request. If the agent exceeds the budget (e.g., from a retry loop), terminate gracefully instead of continuing to burn tokens.

Implementation Steps

  1. 1.Set per-request token limits (e.g., max 50K input tokens, max 4K output tokens)
  2. 2.Set per-request cost caps (e.g., $0.50 max per request)
  3. 3.Implement circuit breakers: stop retrying after 3 failures
  4. 4.Alert when any request exceeds 5x the average cost
15-30% cost reductionOutput Token Management

Output tokens cost 3-5x more than input tokens for most models. Instruct the model to be concise, use max_tokens limits, and avoid unnecessary reasoning in the response.

Implementation Steps

  1. 1.Set max_tokens to a reasonable limit for each use case
  2. 2.Include 'be concise' instructions in the system prompt
  3. 3.For internal agent steps (planning, tool selection), use shorter max_tokens
  4. 4.Only allow verbose output for user-facing final responses

Per-Request Budget Enforcement

Wrap your LLM client with a cost tracker that calculates cost on every call and enforces budget limits. This catches runaway requests before they burn through your budget.

Cost tracking with per-request budget enforcement
// Comprehensive cost tracking and budget management
const MODEL_COSTS_PER_MTOK: Record<string, { input: number; output: number }> = {
  "gpt-4o": { input: 2.50, output: 10.00 },
  "gpt-4o-mini": { input: 0.15, output: 0.60 },
  "claude-sonnet-4-20250514": { input: 3.00, output: 15.00 },
  "claude-haiku-3.5": { input: 0.80, output: 4.00 },
};

interface CostTracker {
  traceId: string;
  userId: string;
  feature: string;
  totalCostUsd: number;
  breakdown: CostEntry[];
}

interface CostEntry {
  model: string;
  inputTokens: number;
  outputTokens: number;
  costUsd: number;
  step: string;
}

function calculateCost(
  model: string,
  inputTokens: number,
  outputTokens: number,
): number {
  const pricing = MODEL_COSTS_PER_MTOK[model];
  if (!pricing) throw new Error(`Unknown model: ${model}`);

  return (
    (inputTokens / 1_000_000) * pricing.input +
    (outputTokens / 1_000_000) * pricing.output
  );
}

async function runAgentWithBudget(
  input: string,
  context: RequestContext,
  budgetUsd: number = 0.50,
): Promise<AgentResult> {
  const tracker: CostTracker = {
    traceId: context.traceId,
    userId: context.userId,
    feature: context.feature,
    totalCostUsd: 0,
    breakdown: [],
  };

  const wrappedLlm = {
    async generate(params: LlmParams): Promise<LlmResponse> {
      const response = await llm.generate(params);
      const cost = calculateCost(
        params.model,
        response.usage.inputTokens,
        response.usage.outputTokens,
      );

      tracker.totalCostUsd += cost;
      tracker.breakdown.push({
        model: params.model,
        inputTokens: response.usage.inputTokens,
        outputTokens: response.usage.outputTokens,
        costUsd: cost,
        step: params.step ?? "unknown",
      });

      // Budget enforcement
      if (tracker.totalCostUsd > budgetUsd) {
        log.warn({
          event: "budget.exceeded",
          traceId: context.traceId,
          spent: tracker.totalCostUsd,
          budget: budgetUsd,
        });
        throw new BudgetExceededError(tracker);
      }

      return response;
    },
  };

  try {
    const result = await agent.execute(input, { ...context, llm: wrappedLlm });
    // Record cost metrics
    await metrics.recordCost(tracker);
    return result;
  } catch (error) {
    await metrics.recordCost(tracker);
    throw error;
  }
}

The Cost Observability Stack

At minimum, you need three things: (1) Per-request cost calculation using model pricing tables and actual token counts. (2) Cost attribution by user, feature, and model in your metrics system. (3) Budget alerts at the request level ($0.50 max), daily level (per-user caps), and monthly level (feature budgets). Without all three, you are flying blind on the fastest-growing line item in your infrastructure budget.

10

Interactive Examples

See observability patterns in action with live code

See agent observability in action. Each example shows a bad pattern and its observable fix. Toggle between them to understand the difference.

LoggingAgent Logging

Capturing agent execution details

Unstructured console.log
// BAD: Unstructured logging that's impossible to query
async function runAgent(input: string) {
  console.log("Starting agent...");
  console.log("Input: " + input);

  const response = await llm.generate({ messages: [{ role: "user", content: input }] });
  console.log("Got response: " + response.text);

  if (response.toolCalls.length > 0) {
    console.log("Tool calls: " + JSON.stringify(response.toolCalls));
    for (const call of response.toolCalls) {
      const result = await executeTool(call);
      console.log("Tool result: " + JSON.stringify(result));
    }
  }

  console.log("Done!");
  return response;
}

Why this fails

Unstructured console.log produces flat text that can't be queried, aggregated, or correlated. When something fails in production at 3 AM, you're grepping through megabytes of text with no way to filter by request, trace, or severity.

All Examples Quick Reference

Logging

Agent Logging

Capturing agent execution details

Tracing

Distributed Tracing

Tracing multi-step agent executions

Metrics

Agent Metrics Collection

Tracking the right KPIs for agent performance

Debugging

Error Context Capture

Capturing enough context to debug failures

Cost

Cost Attribution

Tracking token costs per feature and user

Dashboards

Dashboard Design

Building actionable observability dashboards

Monitoring

Intelligent Alerting

Alerts that wake you up for the right reasons

11

Anti-Patterns & Failure Modes

Log blindness, metric vanity, and how to avoid them

Knowing what not to do is as important as knowing what to do. These are the most common observability anti-patterns in AI agent systems — each one has been observed in production systems and leads to predictable failure modes.

CriticalHighMedium
criticalLog Blindness

Agent runs in production with no structured logging, making it impossible to understand what happened after the fact.

Cause

Using console.log with unstructured strings, or not logging at all. No correlation IDs between related log entries. Logging the wrong things (raw prompts instead of metadata).

Symptom

When a user reports a bad response, the team spends hours trying to reproduce it. Debugging requires reading raw log files with grep. No way to answer 'what did the agent see when it made this decision?'

Fix

Implement structured logging from day one. Every log entry should be JSON with a traceId, event type, timestamp, and relevant metadata. Use a logging library like Pino or Winston with child loggers for context propagation.

criticalTrace Fragmentation

Multi-step agent executions have no parent-child relationships, making it impossible to reconstruct the full execution path.

Cause

Each LLM call and tool execution is logged independently without a shared trace ID. When the agent calls a tool that calls another service, the trace breaks.

Symptom

You can see individual LLM calls but can't connect them into a full agent execution. Debugging a 5-step agent requires manually correlating timestamps across different log streams.

Fix

Use distributed tracing (OpenTelemetry) with span hierarchies. Every agent execution gets a root span, and each LLM call, tool execution, and sub-agent invocation gets a child span with the parent's trace context propagated.

highMetric Vanity

Dashboards show impressive numbers (total requests, average latency, success rate) that never reveal actual problems.

Cause

Tracking averages instead of percentiles. Counting 'success' as 'didn't throw an error' rather than 'produced a correct response'. No breakdown by feature, model, or user segment.

Symptom

Dashboard shows 99% success rate and 500ms average latency while users complain about slow, wrong answers. P99 latency is actually 15 seconds. 'Successful' responses include hallucinated answers.

Fix

Replace averages with percentile distributions (p50, p95, p99). Define 'success' as verified-correct, not just non-error. Break down every metric by model, feature, user segment, and error type. If a metric can't trigger an action, remove it from the dashboard.

highCost Ignorance

No per-request or per-feature cost tracking. The team discovers cost problems when the monthly invoice arrives.

Cause

Token usage and model pricing aren't tracked at the request level. No cost attribution to features, users, or conversations. No budget alerts or spending limits.

Symptom

Monthly LLM bill doubles with no explanation. One verbose system prompt or retry loop silently consumes thousands of dollars. The team can't answer 'which feature costs the most?' or 'which users are most expensive?'

Fix

Calculate cost per request using model pricing tables. Attribute costs to features, users, and conversation IDs. Set per-request cost alerts (e.g., flag anything over $0.50). Implement daily budget caps per feature and user.

highAlert Fatigue

Too many low-quality alerts cause the on-call team to mute notifications, missing critical incidents.

Cause

Alerting on every individual error instead of error rates. No severity tiers. No distinction between transient failures and systematic problems. Static thresholds instead of anomaly detection.

Symptom

200+ alerts per day, all with the same severity. On-call engineer mutes Slack channel after the first shift. A real production incident at 2 AM goes unnoticed for hours because nobody is watching.

Fix

Use SLO-based alerting with error budget burn rates. Tier alerts: info goes to logs, warnings go to Slack, only critical issues page on-call. Every alert must have a runbook. If an alert fires more than 5 times without action, fix it or delete it.

mediumReplay Impossibility

When an agent produces a bad result, it's impossible to reproduce because the exact inputs, context, and state aren't captured.

Cause

Not storing the full message history, tool results, and retrieved documents that the agent saw at execution time. Relying on 'just re-run it' in a non-deterministic system.

Symptom

User reports a bad answer. Developer tries to reproduce it with the same input but gets a different (correct) answer. The bug can't be investigated because the original context is gone forever.

Fix

Store the full execution context (input, system prompt, messages, tool results, retrieved documents, model parameters) for every request, keyed by trace ID. Build a replay tool that can feed this exact context back to the model for deterministic debugging.

12

Best Practices Checklist

Production-ready observability guidelines

Production-ready observability guidelines distilled from SRE practices, LLM platform teams, and the observability community. These are the standards every agent system should meet before going to production.

Tracing & Logging

Assign a trace ID to every request from the start

Generate a UUID at the API gateway and propagate it through every LLM call, tool execution, and sub-agent invocation. This is the single most important observability practice.

Use structured JSON logs, never console.log

Every log entry should be a JSON object with traceId, timestamp, event type, and relevant metadata. Use libraries like Pino (Node.js) for zero-overhead structured logging.

Log decisions, not just actions

Don't just log 'called tool X'. Log why: 'selected tool X because query matched pattern Y with confidence 0.87'. Decision context is the most valuable debugging information.

Implement trace sampling in high-volume production

At scale, tracing every request is expensive. Sample 10-20% of normal traffic, but always trace 100% of errors and slow requests (tail-based sampling).

Metrics & Dashboards

Track percentiles, not averages

P50, P95, and P99 latencies reveal what users actually experience. An average of 500ms hides the fact that 1% of requests take 30 seconds.

Define 'success' as verified-correct, not non-error

An LLM response that doesn't throw an error but hallucinates an answer is not a success. Use eval-based quality checks to track actual correctness.

Break down every metric by model, feature, and user segment

Aggregate metrics hide problems. A 95% success rate overall might mean 99% for simple queries and 70% for complex reasoning tasks.

Build dashboards organized by concern

Group panels into health (latency, errors, throughput), quality (correctness, hallucination rate), and cost (tokens, spend). Every panel should answer a specific operational question.

Alerting & Incident Response

Use SLO-based alerting with error budget burn rates

Alert based on how fast you're consuming your error budget, not on individual errors. A fast burn (14.4x budget/hour) pages immediately. A slow burn (6x budget/6h) sends a Slack notification.

Every alert must have a runbook

An alert without a runbook is just noise. Document what the alert means, likely causes, and step-by-step investigation and remediation procedures.

Tier alert severity and routing

Info goes to logs, warnings go to Slack, only critical issues page on-call. If everything is critical, nothing is.

Conduct post-incident reviews for AI-specific failures

Traditional post-mortems need adaptation for AI systems. Include: what context did the agent see? Was it a model failure or a context failure? Would better observability have caught this earlier?

Debugging & Root Cause Analysis

Classify every failure: model, context, tool, or orchestration

The fix depends on the failure type. A hallucination (model failure) needs a different fix than missing RAG documents (context failure) or a tool timeout (tool failure). Trace-based debugging makes classification possible.

Store full execution context for replay debugging

Capture the input, system prompt, messages, tool results, and model parameters for every request (or a sample). This enables reproducing non-deterministic failures by feeding the exact same context back to the model.

Build a debugging workflow that starts from the trace

Every investigation should start with the trace ID. Pull up the trace tree, identify the failing span, inspect its inputs and outputs, classify the failure, and then fix. Never start by guessing.

Add failing cases to your eval dataset automatically

When you debug and fix a failure, add the input/expected output pair to your evaluation dataset. This prevents regression and builds a growing safety net of test cases from real production issues.

Cost & Performance Optimization

Track cost per request, per feature, and per user

Calculate cost using model pricing tables at the request level. Attribute to features and users. This is the only way to answer 'where is the money going?'

Set budget alerts and per-request cost caps

Alert when a single request exceeds $0.50 or when hourly spend exceeds 2x the 7-day average. Implement circuit breakers for runaway costs.

Monitor token efficiency over time

Track the ratio of useful output tokens to total tokens consumed. A rising ratio means your context engineering is improving. A falling ratio means you're wasting tokens on noise.

Implement model routing based on complexity

Not every request needs GPT-4o. Route simple queries to cheaper models and save expensive models for complex reasoning. Track cost savings from routing decisions.

The Guiding Principle

At any point in time, you should be able to answer three questions about your agent: Is it working correctly? How much is it costing? Where are the bottlenecks? If you cannot answer all three from your observability data within 60 seconds, your observability stack has gaps.

— Adapted from SRE best practices applied to AI systems

13

Resources & Further Reading

Tools, docs, repos, and guides

Essential tools, documentation, repositories, and guides for building observable AI agent systems. From open-source platforms to industry best practices.

toolLangfuse

Langfuse — Open Source LLM Observability

Open-source LLM observability platform with tracing, prompt management, evaluations, and cost tracking. Self-hostable.

toolLangChain

LangSmith — LangChain Observability Platform

Production monitoring and debugging platform for LLM applications. Deep integration with LangChain/LangGraph but works with any framework.

toolArize AI

Arize Phoenix — ML & LLM Observability

Open-source observability for LLMs with trace visualization, retrieval analysis, and evaluation tools. Built for debugging RAG and agent systems.

toolHelicone

Helicone — LLM Gateway & Observability

Proxy-based LLM observability that requires just one line of code. Automatic cost tracking, latency monitoring, and request logging.

toolBraintrust

Braintrust — AI Product Evaluation

End-to-end platform for evaluating, monitoring, and improving AI products. Combines evals, logging, and prompt playground.

docsOpenTelemetry

OpenTelemetry for LLM Observability

The vendor-neutral standard for distributed tracing. Use OTEL SDKs to instrument your agents and export to any backend.

repoTraceloop

OpenLLMetry — OpenTelemetry for LLMs

Open-source auto-instrumentation for LLM frameworks using OpenTelemetry. Automatically traces LangChain, OpenAI, and more.

guideHoneycomb

Honeycomb Guide to Observability

Charity Majors' team explains the difference between monitoring and observability. Essential reading for understanding observability principles.

blogLangChain

Observability for LLM-based Agents

LangChain's guide to building observable LLM applications. Covers tracing, evaluation, and production monitoring patterns.

blogAnthropic

Building Observable AI Systems

Anthropic's engineering blog with insights on building production AI systems including monitoring and evaluation best practices.

Recommended Starting Point

If you are just getting started with agent observability, begin with Langfuse (open-source, generous free tier) for tracing and cost tracking, read the Honeycomb guide to understand observability principles, and add OpenTelemetry instrumentation for vendor-neutral trace collection. You can always switch backends later without re-instrumenting your code.