Agent
Observability
Academy
You can't fix what you can't see. Learn how to trace, debug, and monitor AI agents in production — from structured logging to real-time dashboards and cost tracking.
What is Agent Observability?
Why logging isn't enough for AI agents
Agent observability is the ability to understand what your AI agent is doing, why it made specific decisions, and how it performs — without deploying new code. It goes far beyond traditional logging.
Traditional software is deterministic: the same input produces the same output, and a stack trace tells you exactly where things broke. AI agents are non-deterministic: the same input can produce different outputs, failures are often subtle (wrong answer, not crash), and the root cause might be in the context the model saw, not in the code.
Traditional Software Monitoring
- Errors throw exceptions with stack traces
- Same input always produces same output
- Failures are binary: works or crashes
- Response time is the main performance metric
- Debugging = reading code + stack trace
Agent Observability
- Failures are often subtle (wrong answer, not crash)
- Same input can produce different outputs each time
- Quality exists on a spectrum, not binary
- Latency, tokens, cost, and correctness all matter
- Debugging = understanding what the model saw
Key Insight
When an AI agent produces a bad result, the question isn't “what line of code failed?” — it's “what did the model see when it made that decision?” Agent observability gives you the tools to answer that question.
The Three Pillars
The full execution path of an agent request — from user input through LLM calls, tool executions, and sub-agent invocations to final output. Traces show you the 'what happened' of every request.
Quantitative measurements aggregated over time — latency distributions, token usage, error rates, costs. Metrics show you the 'how is it performing' across all requests.
Structured event records with context — decisions made, tools selected, errors encountered. Logs show you the 'why did it do that' for individual events.
Common Blind Spots Without Observability
Hallucination in the middle of a chain
The LLM hallucinates a fact in step 2 of a 5-step execution. Steps 3-5 build on this false premise. Without tracing, the final output looks wrong, but you can't tell which step introduced the error.
Silent tool failures
A tool returns an empty result instead of throwing an error. The agent continues with no data, producing a vague or generic response. Without observability, this looks like a model quality issue, not a tool failure.
Context window overflow
The agent's conversation history grows past the effective attention window. The model starts ignoring earlier instructions. Performance degrades gradually — no error, no crash, just silently worse outputs.
Cost spikes from retry loops
A tool validation failure triggers a retry loop. The agent makes 15 LLM calls for a single user request, consuming $2 in tokens instead of the expected $0.04. Without cost tracking, this goes unnoticed until the bill arrives.
“Observability is not about collecting data. It's about being able to ask arbitrary questions about your system without deploying new code.”
Charity Majors
CTO, Honeycomb
“The hardest bugs in AI systems are the ones where the model confidently does the wrong thing. Without tracing, you'll never know why.”
Harrison Chase
CEO, LangChain
“If you can't see what your agent is doing at every step, you don't have a production system. You have a demo.”
Jason Liu
Creator, Instructor (structured outputs)
Tracing Agent Execution
Following the chain from input to output
Tracing is the backbone of agent observability. A trace captures the entire execution path of a single request — every LLM call, tool execution, and decision point — as a tree of spans with precise timing and metadata.
Without tracing, a multi-step agent is a black box. You see the input and output, but not the 5-10 intermediate steps that produced the result. When something goes wrong, you're guessing which step failed.
Core Concepts
The top-level span representing the entire agent execution. Every trace has exactly one root span. Its duration is the total request latency.
A sub-operation within a parent span. LLM calls, tool executions, and retrieval steps are typically child spans of the root.
Key-value metadata attached to a span: model name, token counts, tool name, success/failure, latency, and custom business context.
Timestamped log entries within a span. Useful for recording specific moments like 'rate limited, retrying' or 'context window at 80% capacity'.
Anatomy of a Trace Tree
A trace tree shows the parent-child relationships between spans. Each span has a name, duration, and attributes. The tree structure reveals the execution flow and where time is spent.
// A typical agent trace tree structure
{
"traceId": "abc-123-def-456",
"rootSpan": {
"name": "agent.handle_query",
"duration": "3,200ms",
"attributes": { "userId": "usr_42", "model": "gpt-4o" },
"children": [
{
"name": "agent.plan",
"duration": "450ms",
"attributes": { "strategy": "decompose", "steps": 3 }
},
{
"name": "agent.step.retrieve",
"duration": "320ms",
"attributes": { "tool": "vector_search", "results": 5 }
},
{
"name": "agent.step.llm_call",
"duration": "1,800ms",
"attributes": {
"model": "gpt-4o",
"inputTokens": 4200,
"outputTokens": 850,
"cost": "$0.053"
}
},
{
"name": "agent.step.tool_call",
"duration": "280ms",
"attributes": {
"tool": "calculator",
"input": "revenue * 0.15",
"success": true
}
},
{
"name": "agent.synthesize",
"duration": "350ms",
"attributes": { "outputTokens": 200 }
}
]
}
}Reading a trace tree
This trace took 3,200ms total. The LLM call consumed 56% of the time (1,800ms). The planning step generated 3 steps. Vector search found 5 results. The calculator tool succeeded. Total cost was $0.053. If the final answer is wrong, you can pinpoint exactly which step introduced the error.
Trace Waterfall (visual representation)
The waterfall view instantly reveals that the LLM call dominates latency. Optimizing the LLM step (prompt caching, smaller context, cheaper model) would have the biggest impact.
Implementing Traces with OpenTelemetry
OpenTelemetry (OTEL) is the vendor-neutral standard for distributed tracing. It works with every observability backend — Jaeger, Grafana Tempo, Honeycomb, Langfuse, and more. Here's a production setup for Node.js agents.
import { NodeSDK } from "@opentelemetry/sdk-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import { trace, SpanStatusCode, type Span } from "@opentelemetry/api";
// Initialize OpenTelemetry SDK
const sdk = new NodeSDK({
resource: new Resource({
"service.name": "my-agent",
"service.version": "1.0.0",
"deployment.environment": "production",
}),
traceExporter: new OTLPTraceExporter({
url: "https://otel-collector.example.com/v1/traces",
}),
});
sdk.start();
// Create a tracer for agent operations
const tracer = trace.getTracer("agent-tracer");
// Helper to wrap agent operations in spans
export async function withSpan<T>(
name: string,
attributes: Record<string, string | number | boolean>,
fn: (span: Span) => Promise<T>,
): Promise<T> {
return tracer.startActiveSpan(name, async (span) => {
span.setAttributes(attributes);
try {
const result = await fn(span);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error instanceof Error ? error.message : "Unknown error",
});
span.recordException(error as Error);
throw error;
} finally {
span.end();
}
});
}
// Usage in agent code
async function handleQuery(query: string, userId: string) {
return withSpan("agent.query", { userId, queryLength: query.length }, async (rootSpan) => {
// Plan step — automatically a child of rootSpan
const plan = await withSpan("agent.plan", { query }, async (planSpan) => {
const result = await planner.createPlan(query);
planSpan.setAttribute("stepCount", result.steps.length);
return result;
});
// Execute each step
for (const [i, step] of plan.steps.entries()) {
await withSpan(
`agent.step.${step.type}`,
{ stepIndex: i, tool: step.tool ?? "none" },
async (stepSpan) => {
const result = await executeStep(step);
stepSpan.setAttribute("success", result.success);
stepSpan.setAttribute("outputTokens", result.tokens ?? 0);
return result;
},
);
}
});
}Tracing Best Practices
- Name spans by operation, not implementation: Use "agent.plan" not "callGPT4ForPlanning". You want stable names even when you swap models.
- Always record token counts and model name on LLM spans: This is essential for cost attribution and performance analysis.
- Record tool inputs and outputs (truncated) on tool spans: When a tool returns bad data, you need to see what it received and returned.
- Use semantic conventions: Follow OpenTelemetry semantic conventions for GenAI when they become stable. This ensures compatibility across tools.
Structured Logging
Logs that machines and humans can read
Structured logging means emitting logs as typed, machine-parseable JSON objects instead of freeform text strings. Every log entry has a consistent schema with fields for trace ID, event type, timestamps, and relevant metadata.
The difference is stark: console.log("Tool failed: search") is a dead end for debugging. {"event":"tool.error","tool":"search","error":"timeout","latencyMs":5000,"traceId":"abc"} can be queried, aggregated, alerted on, and correlated with other events.
Unstructured (text strings)
Can't filter, can't aggregate, can't correlate across requests. Finding "all requests slower than 5 seconds" means parsing text.
Structured (JSON objects)
Every field is queryable. Find slow requests: latencyMs > 5000. Cost by model: sum(cost) group by model.
Log Levels for AI Agents
Traditional log levels need reinterpretation for agents. A "warning" isn't just "something might be wrong" — it's "the agent is degrading in a way that affects user experience."
When to use: Unrecoverable failures — LLM API errors, tool crashes, rate limit exhaustion, context window overflow
When to use: Degraded behavior — slow responses, high token usage, tool retries, fallback to cheaper model, approaching context limits
When to use: Normal operations — request start/end, LLM call complete, tool execution success, agent step transitions
When to use: Detailed diagnostics — full prompts, retrieved documents, model parameters, embedding scores (disabled in production unless sampling)
Implementation with Pino
Pino is the fastest structured logging library for Node.js — zero overhead in production. Use child loggers to propagate trace context without manually passing IDs everywhere.
import pino from "pino";
// Create a base logger with default fields
const baseLogger = pino({
level: process.env.LOG_LEVEL ?? "info",
formatters: {
level: (label) => ({ level: label }),
},
base: {
service: "my-agent",
version: "1.0.0",
environment: process.env.NODE_ENV,
},
});
// Create a request-scoped logger with trace context
export function createRequestLogger(traceId: string, userId?: string) {
return baseLogger.child({
traceId,
userId,
requestStartedAt: Date.now(),
});
}
// Usage in agent code
async function handleRequest(input: string) {
const traceId = crypto.randomUUID();
const log = createRequestLogger(traceId, "usr_42");
log.info({
event: "agent.request.start",
inputLength: input.length,
inputPreview: input.slice(0, 100),
});
const llmStart = performance.now();
const response = await llm.generate({
model: "gpt-4o",
messages: [{ role: "user", content: input }],
});
log.info({
event: "llm.call.complete",
model: "gpt-4o",
latencyMs: Math.round(performance.now() - llmStart),
inputTokens: response.usage.prompt_tokens,
outputTokens: response.usage.completion_tokens,
totalTokens: response.usage.total_tokens,
finishReason: response.choices[0].finish_reason,
});
if (response.usage.total_tokens > 10000) {
log.warn({
event: "llm.high_token_usage",
totalTokens: response.usage.total_tokens,
threshold: 10000,
});
}
return response;
}Correlation IDs with AsyncLocalStorage
The hardest part of logging is propagating trace context through deeply nested function calls. Node.js's AsyncLocalStorage solves this — any function in the async call stack can access the current request's trace context without explicit parameter passing.
// Correlation: connecting logs across services
import { AsyncLocalStorage } from "node:async_hooks";
interface RequestContext {
traceId: string;
spanId: string;
userId: string;
feature: string;
}
const contextStore = new AsyncLocalStorage<RequestContext>();
// Middleware: set context for entire request lifecycle
export function withRequestContext(
traceId: string,
userId: string,
feature: string,
fn: () => Promise<void>,
) {
const ctx: RequestContext = {
traceId,
spanId: crypto.randomUUID().slice(0, 8),
userId,
feature,
};
return contextStore.run(ctx, fn);
}
// Logger automatically includes request context
export function getLogger() {
const ctx = contextStore.getStore();
if (!ctx) return baseLogger;
return baseLogger.child({
traceId: ctx.traceId,
spanId: ctx.spanId,
userId: ctx.userId,
feature: ctx.feature,
});
}
// Any function in the call stack can get a correlated logger
async function searchDocuments(query: string) {
const log = getLogger(); // Automatically has traceId, userId, etc.
log.info({ event: "rag.search.start", query });
const results = await vectorDB.search(query);
log.info({
event: "rag.search.complete",
resultCount: results.length,
topScore: results[0]?.score,
});
return results;
}What to Log (and What Not To)
Always Log
- Request start/end with latency
- LLM calls: model, tokens, latency, finish reason
- Tool calls: name, latency, success/failure
- Errors: type, message, retry count
- Cost: per-request calculated cost
- Decisions: why a tool or model was selected
Never Log (or log with caution)
- Full user messages (PII risk — truncate or hash)
- Full LLM outputs (storage cost — log length only)
- API keys or auth tokens
- Full RAG document contents (log metadata only)
- Full system prompts (version reference instead)
Metrics & KPIs for Agents
Latency, token usage, success rates, cost
Agent metrics go far beyond response time and error rate. You need to track latency distributions, token consumption, per-request cost, tool call accuracy, and response quality — all broken down by model, feature, and user segment.
The goal is to answer three questions at any time: Is the agent working correctly? How much is it costing? Where are the bottlenecks?
Time to First Token (TTFT)
How long until the user sees the first token of the response. Critical for perceived performance in streaming UIs.
Total Response Latency
End-to-end time from user input to complete response. Includes planning, tool calls, LLM generation, and post-processing.
Tool Call Latency
Time for each tool execution. Slow tools dominate agent latency since LLM calls block on tool results.
Input Tokens per Request
How much context the model receives. Tracks system prompt size, conversation history, RAG documents, and tool definitions. Directly correlates with cost.
Output Tokens per Request
How verbose the model's response is. High output tokens may indicate the model is over-explaining or generating unnecessary content.
Token Efficiency Ratio
Useful output tokens / total input tokens consumed. A falling ratio means you're stuffing more context for the same quality of output.
Cost per Request
Total LLM API cost for a single user request, including all LLM calls, retries, and sub-agent invocations.
Cost per Conversation
Total cost across all messages in a conversation. Rises with conversation length due to growing context windows.
Cost per User (daily/monthly)
Aggregated cost attributed to individual users. Identifies power users and potential abuse.
Success Rate (verified)
Percentage of requests that produce a verified-correct response. Not just 'didn't throw an error' — use eval checks for actual correctness.
Tool Call Accuracy
Percentage of tool calls where the agent selected the correct tool with valid arguments. Wrong tool selection is a major failure mode.
Hallucination Rate
Percentage of responses flagged by automated eval checks as containing fabricated information. Requires running evals on sampled production traffic.
Retry Rate
Percentage of requests that required at least one retry (LLM or tool). High retry rates indicate instability or poor prompt design.
Why Percentiles, Not Averages
An average latency of 1.5 seconds sounds fine. But if your p99 is 25 seconds, 1 in 100 users waits 25 seconds. With 10,000 daily requests, that's 100 terrible experiences per day. Percentiles (p50, p95, p99) reveal the full distribution. Use histogram metric types to compute percentiles efficiently.
Typical Agent Latency Budget (3-second target)
LLM generation dominates at 60% of total latency. Optimize the prompt/context size first — a 50% reduction in input tokens can reduce LLM latency by 30-40%.
Implementation with OpenTelemetry Metrics
import { metrics } from "@opentelemetry/api";
const meter = metrics.getMeter("agent-metrics");
// Latency histograms (use buckets that match your SLOs)
const requestLatency = meter.createHistogram("agent.request.latency_ms", {
description: "End-to-end request latency",
unit: "ms",
});
const ttft = meter.createHistogram("agent.ttft_ms", {
description: "Time to first token",
unit: "ms",
});
const toolLatency = meter.createHistogram("agent.tool.latency_ms", {
description: "Individual tool call latency",
unit: "ms",
});
// Token counters
const inputTokens = meter.createHistogram("agent.tokens.input", {
description: "Input tokens per request",
});
const outputTokens = meter.createHistogram("agent.tokens.output", {
description: "Output tokens per request",
});
// Cost gauge
const requestCost = meter.createHistogram("agent.cost_usd", {
description: "Cost per request in USD",
});
// Quality counters
const requestCounter = meter.createCounter("agent.requests.total", {
description: "Total requests by status",
});
const toolCallCounter = meter.createCounter("agent.tool_calls.total", {
description: "Tool calls by tool and status",
});
// Recording metrics in agent code
async function handleRequest(input: string, context: RequestContext) {
const start = performance.now();
const labels = {
model: context.model,
feature: context.feature,
};
try {
const result = await agent.execute(input, context);
const latencyMs = performance.now() - start;
requestLatency.record(latencyMs, { ...labels, status: "success" });
inputTokens.record(result.usage.inputTokens, labels);
outputTokens.record(result.usage.outputTokens, labels);
requestCost.record(result.estimatedCost, labels);
requestCounter.add(1, { ...labels, status: "success" });
if (result.ttftMs) {
ttft.record(result.ttftMs, labels);
}
for (const tool of result.toolCalls) {
toolCallCounter.add(1, {
tool: tool.name,
success: String(tool.success),
});
toolLatency.record(tool.latencyMs, { tool: tool.name });
}
return result;
} catch (error) {
requestLatency.record(performance.now() - start, { ...labels, status: "error" });
requestCounter.add(1, { ...labels, status: "error" });
throw error;
}
}Debugging Agent Failures
Root cause analysis for non-deterministic systems
Debugging AI agents is fundamentally different from debugging traditional software. There are no stack traces for wrong answers, no breakpoints for hallucinations. The root cause often lies in what the model saw, not in the code that ran.
Non-determinism makes reproduction difficult — the same input can produce different outputs on each run. Without captured execution context, you cannot reliably investigate failures. This is why observability is a prerequisite for debugging, not an afterthought.
The Four Categories of Agent Failure
Every agent failure falls into one of these categories. Classifying the failure type is the first step to fixing it — each category has a different debugging approach and different fix.
The LLM itself produces incorrect output despite receiving good context. Includes hallucinations, reasoning errors, instruction non-compliance, and format violations.
How to Debug
- 1.Compare the model's input (full context window) against its output
- 2.Check if the answer contradicts information in the context
- 3.Verify that instructions in the system prompt were followed
- 4.Test with a different model to isolate model-specific issues
The model produced a reasonable output given what it saw, but it saw the wrong information. Includes missing documents, irrelevant RAG results, stale data, and context overflow.
How to Debug
- 1.Inspect the exact context window contents at the time of generation
- 2.Verify RAG retrieval quality — were the right documents retrieved?
- 3.Check token counts — was the context window overflowing?
- 4.Look for contradictions between retrieved documents
A tool returns incorrect data, times out, or fails silently. The agent continues with bad or missing data, producing a subtly wrong response.
How to Debug
- 1.Check tool span: what input did the tool receive? What did it return?
- 2.Look for empty or null tool responses that weren't flagged as errors
- 3.Verify tool response latency — did it timeout?
- 4.Compare tool output against ground truth data
The agent's planning or routing logic made a poor decision. Wrong tool selected, unnecessary steps taken, infinite loops, or premature termination.
How to Debug
- 1.Examine the plan span: what strategy did the agent choose and why?
- 2.Count the number of steps — is it reasonable for this query?
- 3.Look for repeated tool calls that suggest a retry loop
- 4.Check if the agent terminated before completing all necessary steps
The 6-Step Debugging Workflow
Start from the trace
Pull up the trace for the failing request using its trace ID. The trace tree gives you the full execution path.
Identify the failing span
Walk the trace tree. Which span produced the first incorrect output? Was it a planning step, an LLM call, or a tool execution?
Inspect inputs and outputs
For the failing span, examine what went in (context, parameters) and what came out (response, tool result). The bug is in the gap between expected and actual.
Classify the failure
Is it a model failure (bad output given good context), context failure (bad context), tool failure (bad data), or orchestration failure (bad plan)?
Reproduce with replay
Use the stored execution context (input, messages, tool results) to replay the exact same request. If you can reproduce it, you can fix it.
Fix and add a regression eval
Apply the fix and add this case to your evaluation dataset. Future changes will be tested against this failure case.
Replay Debugging: Reproducing Non-Deterministic Failures
The hardest part of debugging AI agents is that you can't just "re-run it with the same input" — the model might give a different (correct) answer the second time. Replay debugging solves this by capturing and storing the exact execution context so you can feed it back to the model later.
// Store execution context for replay debugging
interface ExecutionSnapshot {
traceId: string;
input: string;
systemPrompt: string;
messages: Message[];
toolResults: Record<string, unknown>;
retrievedDocuments: Document[];
modelConfig: { model: string; temperature: number };
timestamp: number;
}
// Capture on every request (or sample in production)
async function captureSnapshot(
traceId: string,
context: AgentContext,
): Promise<void> {
const snapshot: ExecutionSnapshot = {
traceId,
input: context.input,
systemPrompt: context.systemPrompt,
messages: context.messages,
toolResults: context.toolResults,
retrievedDocuments: context.retrievedDocs,
modelConfig: {
model: context.model,
temperature: context.temperature,
},
timestamp: Date.now(),
};
// Store in debug database (TTL: 30 days)
await debugStore.save(traceId, snapshot, { ttlDays: 30 });
}
// Replay a failing request with the exact same context
async function replayRequest(traceId: string): Promise<ReplayResult> {
const snapshot = await debugStore.get(traceId);
if (!snapshot) throw new Error(`No snapshot for trace ${traceId}`);
// Reconstruct the exact context the model saw
const replayResponse = await llm.generate({
model: snapshot.modelConfig.model,
temperature: snapshot.modelConfig.temperature,
system: snapshot.systemPrompt,
messages: snapshot.messages,
});
return {
originalTraceId: traceId,
replayTraceId: crypto.randomUUID(),
originalOutput: snapshot.messages.at(-1)?.content,
replayOutput: replayResponse.text,
matched: replayResponse.text === snapshot.messages.at(-1)?.content,
};
}The Debugging Mindset Shift
In traditional software, you debug the code. In AI agents, you debug the context. The model is usually capable of producing the right answer — the question is whether it received the right information. Every debugging session should start with: "What did the model see when it made this decision?"
Observability Tools
LangSmith, Phoenix, Langfuse, Braintrust
The agent observability ecosystem has matured rapidly. From open-source platforms to managed services, there are now dedicated tools for every aspect of LLM monitoring. Here is a comprehensive comparison of the leading options.
The right tool depends on your constraints: team size, budget, existing infrastructure, and whether you need self-hosting. Most teams start with one managed platform and add OpenTelemetry as they scale.
Production monitoring and debugging platform from the LangChain team. Deep integration with LangChain and LangGraph, but works with any framework via the LangSmith SDK. Strong trace visualization with a playground for prompt iteration.
Key Features
- ▸Trace visualization with span-level detail
- ▸Prompt playground for iteration and testing
- ▸Dataset management for evaluations
- ▸Online evaluation runners
- ▸Human annotation queues
- ▸Comparison views for A/B testing prompts
Open-source observability platform built specifically for LLM applications. Strong focus on trace visualization, retrieval analysis, and evaluation. Runs locally or in the cloud. Great for debugging RAG systems.
Key Features
- ▸Open-source (Apache 2.0) — self-host anywhere
- ▸Trace and span visualization
- ▸Retrieval (RAG) quality analysis
- ▸Embedding drift detection
- ▸LLM evaluation framework
- ▸OpenTelemetry-native instrumentation
Open-source LLM observability and analytics platform. Framework-agnostic with SDKs for Python and TypeScript. Includes prompt management, cost tracking, and evaluation features. Can be self-hosted.
Key Features
- ▸Open-source (MIT) — full self-hosting support
- ▸Framework-agnostic SDK (works with any LLM provider)
- ▸Prompt management and versioning
- ▸Cost tracking with model pricing tables
- ▸Evaluation and scoring pipelines
- ▸Session and user-level analytics
End-to-end platform combining evaluations, logging, and prompt playground. Strong focus on eval-driven development with built-in scoring functions. Also serves as an AI proxy for cost optimization.
Key Features
- ▸Eval framework with built-in scoring functions
- ▸Production logging and tracing
- ▸Prompt playground with side-by-side comparison
- ▸AI proxy with caching and fallbacks
- ▸Dataset management for golden test sets
- ▸CI/CD integration for automated eval runs
Proxy-based observability that requires just one line of code to integrate. Routes your LLM API calls through Helicone's gateway, automatically capturing cost, latency, and usage metrics without SDK changes.
Key Features
- ▸One-line integration (URL swap)
- ▸Automatic cost and latency tracking
- ▸Request/response logging
- ▸Rate limiting and caching at the proxy layer
- ▸User-level analytics and cost attribution
- ▸Prompt threat detection
Build your own observability stack using the OpenTelemetry standard. Use OTEL SDKs for instrumentation and export to any backend — Jaeger, Grafana Tempo, Honeycomb, Datadog, or custom storage. Maximum flexibility, but requires more setup.
Key Features
- ▸Vendor-neutral standard — no lock-in
- ▸Works with any observability backend
- ▸OpenLLMetry provides auto-instrumentation for LLM frameworks
- ▸Full control over what you trace and store
- ▸Integrates with existing infrastructure
- ▸Semantic conventions for GenAI (emerging standard)
| Dimension | LangSmith | Arize | Langfuse | Braintrust | Helicone | OpenTelemetry |
|---|---|---|---|---|---|---|
| Open Source | No | Yes (Apache 2.0) | Yes (MIT) | No | No | Yes (CNCF) |
| Self-Hostable | No | Yes | Yes | No | No | Yes |
| Tracing | Strong | Strong | Strong | Good | Basic | Excellent |
| Evals | Strong | Good | Good | Excellent | Basic | Manual |
| Cost Tracking | Good | Basic | Strong | Good | Excellent | Manual |
| Setup Effort | Low | Low | Low | Low | Minimal | High |
How to Choose
- Starting out? Pick Langfuse (open-source, generous free tier) or Helicone (one-line setup).
- Using LangChain/LangGraph? LangSmith has the deepest integration and best trace UI for these frameworks.
- Need evals + monitoring? Braintrust combines both in one platform, reducing tool sprawl.
- Enterprise with existing infra? OpenTelemetry + your existing backend (Datadog, Grafana) avoids adding another vendor.
- Privacy/compliance requirements? Langfuse or Phoenix can be self-hosted in your own infrastructure.
Building Dashboards
Real-time visibility into agent performance
A well-designed dashboard is the command center for your agent in production. It should answer three questions at a glance: Is it working? Is it correct? Is it affordable?
The difference between a useful dashboard and a vanity dashboard is actionability. Every panel should either tell you things are fine, or tell you exactly where to look when they are not. If a metric has never triggered an investigation, it does not belong on the dashboard.
The first row every engineer looks at. Answers: 'Is the agent working right now?' If any panel is red, investigate immediately.
Request Rate (RPM)
A sudden drop means something is broken upstream. A spike means unexpected traffic or a retry storm.
Error Rate by Type
Not just 'errors happened' but which type: rate limits, timeouts, tool failures, or model errors. Each has a different fix.
Latency Distribution (p50/p95/p99)
p50 tells you the typical experience. p99 tells you the worst experience. If p99 diverges from p50, you have outlier issues.
Active Agent Sessions
Track concurrency. If sessions spike, you may hit rate limits or exhaust your API quota.
The hardest row to build but the most important. Answers: 'Are the agent's responses actually good?' Requires eval pipelines on sampled traffic.
Tool Call Success Rate by Tool
If one specific tool is failing while others are fine, the problem is localized. Drill down by tool name.
Eval-Flagged Hallucination Rate
Automated evals on sampled production traffic detect hallucinations. A rising rate means the model or context quality is degrading.
Completion Rate
What percentage of requests get a full response? Low completion means early terminations, timeouts, or context window overflows.
User Feedback Score
Thumbs up/down or star ratings from users. The ground truth signal for quality, but sparse and delayed.
Answers: 'How much is the agent costing, and where is the money going?' Essential for budget planning and catching runaway costs before the bill arrives.
Cost per Hour by Model
Identifies which models dominate your spend. If GPT-4o cost suddenly spikes, a prompt change may have increased token usage.
Tokens per Request (Input vs Output)
Rising input tokens = growing context. Rising output tokens = verbosity. Both directly increase cost.
Cost per User (Top 10)
Power users (or abuse) can dominate your spend. Set per-user budgets and catch outliers early.
Cost per Feature
Which features are expensive? A search feature using GPT-4o might be 10x costlier than one using GPT-4o-mini with no quality difference.
Dashboard Design Principles
Top-to-bottom severity
Health at the top (first thing you see), quality in the middle, cost at the bottom. In an incident, you read top-down.
Every panel answers one question
If you can't articulate what question a panel answers, delete it. 'What is our current error rate by type?' is good. 'Various metrics' is not.
Alert thresholds visible on panels
Draw horizontal lines on charts at your SLO thresholds. This makes it instantly clear when a metric is approaching danger.
Time range consistency
All panels should use the same time range. A mismatch between a 5-minute error rate and a 24-hour cost total creates confusion.
Drill-down links on every panel
Clicking an error rate panel should take you to filtered traces for those errors. Dashboards are for detection; traces are for investigation.
No more than 3 dashboards per team
An overview dashboard, a debugging dashboard, and a cost dashboard. More than that and nobody looks at any of them.
Example: Agent Health Overview
Requests/min
142
Error Rate
1.8%
p50 Latency
1.2s
p99 Latency
8.4s
A healthy dashboard at a glance: stable request rate, low error rate, acceptable p50, and a p99 that deserves investigation (shown in yellow).
The One-Dashboard Rule
If your team has to look at more than one dashboard to answer "is the agent healthy?", you have too many dashboards. Start with a single overview that covers health, quality, and cost. Only split into separate dashboards when the single view becomes too crowded to be useful at a glance.
Production Monitoring
Alerts, anomaly detection, and incident response
Production monitoring for AI agents requires a fundamentally different approach than traditional services. Agents fail in subtle ways — wrong answers, slowly degrading quality, invisible cost creep — that standard uptime checks will never catch.
The foundation is SLO-based alerting: define what "good" looks like (Service Level Objectives), measure against it continuously, and alert based on how fast you are consuming your error budget — not on individual errors.
Agent Response Quality SLO
Target: 99.5% of requests complete successfully in < 10 seconds
Error budget: 0.5% = ~3.6 hours of downtime per month
| Window | Burn Rate | Action | Severity |
|---|---|---|---|
| 1 hour | 14.4x | Page on-call | critical |
| 6 hours | 6x | Slack alert | warning |
| 24 hours | 3x | Ticket | info |
A fast burn (14.4x) means you will exhaust your monthly error budget in about 1 hour if the rate continues. This pages immediately. A slow burn (6x) gives you 5 hours, warranting a Slack alert.
Alert Severity Tiers
The most important rule in alerting: if everything is critical, nothing is critical. Tier your alerts by severity, and make sure each tier has a clear response expectation and routing.
Action: Page on-call engineer immediately
Example Triggers
- ▸Error budget burn rate > 14.4x monthly budget (1-hour window)
- ▸Zero successful requests for 5+ minutes (total outage)
- ▸p99 latency exceeds 30 seconds for 10+ minutes
- ▸LLM API returning 500s for all requests
Action: Notify team via Slack, investigate during business hours
Example Triggers
- ▸Error budget burn rate > 6x monthly budget (6-hour window)
- ▸Hourly cost exceeds 2x the 7-day average
- ▸Hallucination rate above 3% (detected via evals)
- ▸Token usage trending upward 20%+ week-over-week
Action: Log for review, include in weekly reports
Example Triggers
- ▸New model version deployed successfully
- ▸Daily cost report summary
- ▸Eval scores trending above baseline
- ▸Rate limit approached but not exceeded
Common Anomaly Patterns in Agent Systems
These are the subtle failure modes that simple threshold alerts miss. Each requires a different detection strategy — often comparing current behavior against historical baselines rather than fixed thresholds.
Token Usage Drift
Average input tokens per request gradually increasing over days/weeks. Usually caused by growing conversation histories, expanding system prompts, or RAG returning more chunks.
Detection
Compare 7-day rolling average against 30-day baseline. Alert on > 20% increase.
Impact
Cost increases linearly. Quality may degrade as context window fills. Eventually hits model limits.
Latency Bimodality
Latency distribution develops two peaks instead of one. Some requests complete in 1-2s while others take 10-15s. Often caused by tool timeouts or cache miss patterns.
Detection
Monitor the gap between p50 and p95. When p95 > 5x p50, investigate.
Impact
Inconsistent user experience. Slow requests may timeout at the client level.
Silent Quality Degradation
Success rate (non-error) stays high but eval-measured quality drops. The model is returning answers that don't throw errors but are subtly wrong or less helpful.
Detection
Run automated evals on 5-10% of production traffic. Track eval scores over time.
Impact
Users lose trust gradually. By the time you notice from feedback, many users have churned.
Cost Spike from Retry Storms
A tool or model intermittently fails, triggering retry logic. Each retry makes a full LLM call, multiplying cost. A single user request can generate 10-20 LLM calls.
Detection
Track retry count per request. Alert when any request exceeds 3 retries.
Impact
5-20x cost increase per affected request. Can exhaust rate limits, cascading to other users.
Agent-Specific Incident Response
Every alert needs a runbook. For AI agents, runbooks must include checking the LLM provider status, examining recent prompt/model changes, and verifying context quality — not just checking server health.
// Agent-specific incident response playbook
interface IncidentPlaybook {
symptom: string;
severity: "critical" | "warning";
steps: string[];
escalation: string;
}
const playbooks: IncidentPlaybook[] = [
{
symptom: "Error rate exceeds SLO",
severity: "critical",
steps: [
"Check LLM provider status page (OpenAI, Anthropic)",
"Review error type breakdown — is it rate limits, timeouts, or 500s?",
"If rate limits: enable fallback model routing",
"If timeouts: check tool service health",
"If 500s: check for recent deployment (prompt/model change)",
"If no provider issue: check recent code/config changes",
],
escalation: "If unresolved in 15 minutes, escalate to platform team",
},
{
symptom: "Cost anomaly detected (2x+ hourly average)",
severity: "warning",
steps: [
"Identify which model/feature is driving the spike",
"Check for retry storms: look for requests with retries > 3",
"Check for token usage spikes: was a system prompt changed?",
"Check for traffic spikes: is it legitimate traffic or abuse?",
"If abuse: enable per-user rate limiting",
"If retry storm: fix the failing tool/model and reset",
],
escalation: "If cost exceeds $500/hour, page finance + engineering lead",
},
{
symptom: "Hallucination rate spike",
severity: "warning",
steps: [
"Check if a new model version was deployed",
"Verify RAG retrieval quality — are correct docs being returned?",
"Check system prompt: was it recently modified?",
"Review recent eval results for regression patterns",
"If model change: rollback to previous model version",
"If context issue: investigate RAG pipeline health",
],
escalation: "If affects > 5% of requests, escalate to AI/ML team",
},
];
// Automated incident detection
async function checkAgentHealth(
metricsClient: MetricsClient,
): Promise<IncidentAlert[]> {
const alerts: IncidentAlert[] = [];
const now = Date.now();
// Check error budget burn rate
const errorRate1h = await metricsClient.query(
"rate(agent.errors[1h]) / rate(agent.requests[1h])"
);
const monthlyBudget = 0.01; // 99% SLO = 1% error budget
if (errorRate1h > monthlyBudget * 14.4) {
alerts.push({
type: "error_budget_fast_burn",
severity: "critical",
message: `Error budget burning at ${(errorRate1h / monthlyBudget).toFixed(1)}x monthly rate`,
runbook: "https://wiki/runbooks/agent-error-budget",
timestamp: now,
});
}
// Check cost anomaly
const hourlyCost = await metricsClient.query("sum(agent.cost_usd[1h])");
const avgHourlyCost7d = await metricsClient.query(
"avg_over_time(sum(agent.cost_usd[1h])[7d:])"
);
if (hourlyCost > avgHourlyCost7d * 2) {
alerts.push({
type: "cost_anomaly",
severity: "warning",
message: `Hourly cost $${hourlyCost.toFixed(2)} is ${(hourlyCost / avgHourlyCost7d).toFixed(1)}x the 7-day average`,
runbook: "https://wiki/runbooks/agent-cost",
timestamp: now,
});
}
return alerts;
}The Post-Incident Review Question
Traditional post-mortems ask "what code failed?" Agent post-incident reviews need additional questions: What context did the agent see? Was it a model failure or a context failure? Would better observability have caught this earlier? Always add the failing case to your eval dataset.
Cost Tracking & Optimization
Know where every token dollar goes
LLM costs are the most unpredictable line item in your infrastructure budget. Unlike traditional compute where costs scale linearly with traffic, agent costs can spike 10-50x based on prompt changes, retry loops, or context window growth — often without anyone noticing until the bill arrives.
The solution is per-request cost attribution: know exactly how much every request costs, attribute it to a user, feature, and model, and set alerts before spending spirals. You need to answer "where is every token dollar going?" at any time.
Typical Agent Cost Breakdown
LLM tokens (input + output) dominate at 60%. Output tokens are particularly expensive — 3-5x the cost of input tokens for most models. Retries add 15% on average; optimizing retry logic has outsized cost impact.
| Model | Input | Output | Best For |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | Complex reasoning, coding |
| GPT-4o-mini | $0.15 | $0.60 | Simple tasks, classification |
| Claude Sonnet 4 | $3.00 | $15.00 | Complex analysis, coding |
| Claude Haiku 3.5 | $0.80 | $4.00 | Quick tasks, routing |
| Gemini 2.0 Flash | $0.10 | $0.40 | High-volume, simple |
Output tokens cost 3-5x more than input tokens. A 1,000-token response from GPT-4o costs $0.01 — but that's $10K/month at 1M requests. Pricing as of early 2025; check provider sites for current rates.
Cost Optimization Strategies
These five strategies, applied together, can reduce agent costs by 60-80% without meaningful quality degradation. Start with model routing — it typically has the highest ROI.
Not every request needs GPT-4o. Route simple queries (FAQ, classification, extraction) to cheaper models and reserve expensive models for complex reasoning. Use a lightweight classifier to determine complexity.
Implementation Steps
- 1.Build a complexity classifier (can be rule-based or a cheap LLM call)
- 2.Route simple queries to GPT-4o-mini or Haiku ($0.15-0.80/MTok vs $2.50-3.00/MTok)
- 3.Route complex queries to GPT-4o or Claude Sonnet
- 4.Track quality metrics per route to ensure cheap models aren't degrading output
If your system prompt and tool definitions are identical across requests, prompt caching lets you pay for those tokens once instead of every request. Both Anthropic and OpenAI offer native prompt caching.
Implementation Steps
- 1.Structure prompts with static prefix (system prompt, tools) and dynamic suffix (user input)
- 2.Enable prompt caching on your LLM provider (automatic for Anthropic, explicit for OpenAI)
- 3.Monitor cache hit rate — should be > 80% for repetitive workloads
- 4.Keep the static prefix stable — changing it invalidates the cache
Every token in the context window costs money. Reduce input tokens by compressing conversation history, limiting RAG results, and trimming tool definitions to only what's needed for each request.
Implementation Steps
- 1.Implement conversation compaction at 60-80% of context window capacity
- 2.Limit RAG retrieval to top-3 chunks with minimum relevance threshold of 0.7
- 3.Use dynamic tool selection to include only relevant tool definitions
- 4.Compress tool outputs — return structured summaries instead of raw API responses
Set maximum token and cost budgets per request. If the agent exceeds the budget (e.g., from a retry loop), terminate gracefully instead of continuing to burn tokens.
Implementation Steps
- 1.Set per-request token limits (e.g., max 50K input tokens, max 4K output tokens)
- 2.Set per-request cost caps (e.g., $0.50 max per request)
- 3.Implement circuit breakers: stop retrying after 3 failures
- 4.Alert when any request exceeds 5x the average cost
Output tokens cost 3-5x more than input tokens for most models. Instruct the model to be concise, use max_tokens limits, and avoid unnecessary reasoning in the response.
Implementation Steps
- 1.Set max_tokens to a reasonable limit for each use case
- 2.Include 'be concise' instructions in the system prompt
- 3.For internal agent steps (planning, tool selection), use shorter max_tokens
- 4.Only allow verbose output for user-facing final responses
Per-Request Budget Enforcement
Wrap your LLM client with a cost tracker that calculates cost on every call and enforces budget limits. This catches runaway requests before they burn through your budget.
// Comprehensive cost tracking and budget management
const MODEL_COSTS_PER_MTOK: Record<string, { input: number; output: number }> = {
"gpt-4o": { input: 2.50, output: 10.00 },
"gpt-4o-mini": { input: 0.15, output: 0.60 },
"claude-sonnet-4-20250514": { input: 3.00, output: 15.00 },
"claude-haiku-3.5": { input: 0.80, output: 4.00 },
};
interface CostTracker {
traceId: string;
userId: string;
feature: string;
totalCostUsd: number;
breakdown: CostEntry[];
}
interface CostEntry {
model: string;
inputTokens: number;
outputTokens: number;
costUsd: number;
step: string;
}
function calculateCost(
model: string,
inputTokens: number,
outputTokens: number,
): number {
const pricing = MODEL_COSTS_PER_MTOK[model];
if (!pricing) throw new Error(`Unknown model: ${model}`);
return (
(inputTokens / 1_000_000) * pricing.input +
(outputTokens / 1_000_000) * pricing.output
);
}
async function runAgentWithBudget(
input: string,
context: RequestContext,
budgetUsd: number = 0.50,
): Promise<AgentResult> {
const tracker: CostTracker = {
traceId: context.traceId,
userId: context.userId,
feature: context.feature,
totalCostUsd: 0,
breakdown: [],
};
const wrappedLlm = {
async generate(params: LlmParams): Promise<LlmResponse> {
const response = await llm.generate(params);
const cost = calculateCost(
params.model,
response.usage.inputTokens,
response.usage.outputTokens,
);
tracker.totalCostUsd += cost;
tracker.breakdown.push({
model: params.model,
inputTokens: response.usage.inputTokens,
outputTokens: response.usage.outputTokens,
costUsd: cost,
step: params.step ?? "unknown",
});
// Budget enforcement
if (tracker.totalCostUsd > budgetUsd) {
log.warn({
event: "budget.exceeded",
traceId: context.traceId,
spent: tracker.totalCostUsd,
budget: budgetUsd,
});
throw new BudgetExceededError(tracker);
}
return response;
},
};
try {
const result = await agent.execute(input, { ...context, llm: wrappedLlm });
// Record cost metrics
await metrics.recordCost(tracker);
return result;
} catch (error) {
await metrics.recordCost(tracker);
throw error;
}
}The Cost Observability Stack
At minimum, you need three things: (1) Per-request cost calculation using model pricing tables and actual token counts. (2) Cost attribution by user, feature, and model in your metrics system. (3) Budget alerts at the request level ($0.50 max), daily level (per-user caps), and monthly level (feature budgets). Without all three, you are flying blind on the fastest-growing line item in your infrastructure budget.
Interactive Examples
See observability patterns in action with live code
See agent observability in action. Each example shows a bad pattern and its observable fix. Toggle between them to understand the difference.
Capturing agent execution details
// BAD: Unstructured logging that's impossible to query
async function runAgent(input: string) {
console.log("Starting agent...");
console.log("Input: " + input);
const response = await llm.generate({ messages: [{ role: "user", content: input }] });
console.log("Got response: " + response.text);
if (response.toolCalls.length > 0) {
console.log("Tool calls: " + JSON.stringify(response.toolCalls));
for (const call of response.toolCalls) {
const result = await executeTool(call);
console.log("Tool result: " + JSON.stringify(result));
}
}
console.log("Done!");
return response;
}Why this fails
Unstructured console.log produces flat text that can't be queried, aggregated, or correlated. When something fails in production at 3 AM, you're grepping through megabytes of text with no way to filter by request, trace, or severity.
All Examples Quick Reference
Agent Logging
Capturing agent execution details
Distributed Tracing
Tracing multi-step agent executions
Agent Metrics Collection
Tracking the right KPIs for agent performance
Error Context Capture
Capturing enough context to debug failures
Cost Attribution
Tracking token costs per feature and user
Dashboard Design
Building actionable observability dashboards
Intelligent Alerting
Alerts that wake you up for the right reasons
Anti-Patterns & Failure Modes
Log blindness, metric vanity, and how to avoid them
Knowing what not to do is as important as knowing what to do. These are the most common observability anti-patterns in AI agent systems — each one has been observed in production systems and leads to predictable failure modes.
Agent runs in production with no structured logging, making it impossible to understand what happened after the fact.
Cause
Using console.log with unstructured strings, or not logging at all. No correlation IDs between related log entries. Logging the wrong things (raw prompts instead of metadata).
Symptom
When a user reports a bad response, the team spends hours trying to reproduce it. Debugging requires reading raw log files with grep. No way to answer 'what did the agent see when it made this decision?'
Fix
Implement structured logging from day one. Every log entry should be JSON with a traceId, event type, timestamp, and relevant metadata. Use a logging library like Pino or Winston with child loggers for context propagation.
Multi-step agent executions have no parent-child relationships, making it impossible to reconstruct the full execution path.
Cause
Each LLM call and tool execution is logged independently without a shared trace ID. When the agent calls a tool that calls another service, the trace breaks.
Symptom
You can see individual LLM calls but can't connect them into a full agent execution. Debugging a 5-step agent requires manually correlating timestamps across different log streams.
Fix
Use distributed tracing (OpenTelemetry) with span hierarchies. Every agent execution gets a root span, and each LLM call, tool execution, and sub-agent invocation gets a child span with the parent's trace context propagated.
Dashboards show impressive numbers (total requests, average latency, success rate) that never reveal actual problems.
Cause
Tracking averages instead of percentiles. Counting 'success' as 'didn't throw an error' rather than 'produced a correct response'. No breakdown by feature, model, or user segment.
Symptom
Dashboard shows 99% success rate and 500ms average latency while users complain about slow, wrong answers. P99 latency is actually 15 seconds. 'Successful' responses include hallucinated answers.
Fix
Replace averages with percentile distributions (p50, p95, p99). Define 'success' as verified-correct, not just non-error. Break down every metric by model, feature, user segment, and error type. If a metric can't trigger an action, remove it from the dashboard.
No per-request or per-feature cost tracking. The team discovers cost problems when the monthly invoice arrives.
Cause
Token usage and model pricing aren't tracked at the request level. No cost attribution to features, users, or conversations. No budget alerts or spending limits.
Symptom
Monthly LLM bill doubles with no explanation. One verbose system prompt or retry loop silently consumes thousands of dollars. The team can't answer 'which feature costs the most?' or 'which users are most expensive?'
Fix
Calculate cost per request using model pricing tables. Attribute costs to features, users, and conversation IDs. Set per-request cost alerts (e.g., flag anything over $0.50). Implement daily budget caps per feature and user.
Too many low-quality alerts cause the on-call team to mute notifications, missing critical incidents.
Cause
Alerting on every individual error instead of error rates. No severity tiers. No distinction between transient failures and systematic problems. Static thresholds instead of anomaly detection.
Symptom
200+ alerts per day, all with the same severity. On-call engineer mutes Slack channel after the first shift. A real production incident at 2 AM goes unnoticed for hours because nobody is watching.
Fix
Use SLO-based alerting with error budget burn rates. Tier alerts: info goes to logs, warnings go to Slack, only critical issues page on-call. Every alert must have a runbook. If an alert fires more than 5 times without action, fix it or delete it.
When an agent produces a bad result, it's impossible to reproduce because the exact inputs, context, and state aren't captured.
Cause
Not storing the full message history, tool results, and retrieved documents that the agent saw at execution time. Relying on 'just re-run it' in a non-deterministic system.
Symptom
User reports a bad answer. Developer tries to reproduce it with the same input but gets a different (correct) answer. The bug can't be investigated because the original context is gone forever.
Fix
Store the full execution context (input, system prompt, messages, tool results, retrieved documents, model parameters) for every request, keyed by trace ID. Build a replay tool that can feed this exact context back to the model for deterministic debugging.
Best Practices Checklist
Production-ready observability guidelines
Production-ready observability guidelines distilled from SRE practices, LLM platform teams, and the observability community. These are the standards every agent system should meet before going to production.
Assign a trace ID to every request from the start
Generate a UUID at the API gateway and propagate it through every LLM call, tool execution, and sub-agent invocation. This is the single most important observability practice.
Use structured JSON logs, never console.log
Every log entry should be a JSON object with traceId, timestamp, event type, and relevant metadata. Use libraries like Pino (Node.js) for zero-overhead structured logging.
Log decisions, not just actions
Don't just log 'called tool X'. Log why: 'selected tool X because query matched pattern Y with confidence 0.87'. Decision context is the most valuable debugging information.
Implement trace sampling in high-volume production
At scale, tracing every request is expensive. Sample 10-20% of normal traffic, but always trace 100% of errors and slow requests (tail-based sampling).
Track percentiles, not averages
P50, P95, and P99 latencies reveal what users actually experience. An average of 500ms hides the fact that 1% of requests take 30 seconds.
Define 'success' as verified-correct, not non-error
An LLM response that doesn't throw an error but hallucinates an answer is not a success. Use eval-based quality checks to track actual correctness.
Break down every metric by model, feature, and user segment
Aggregate metrics hide problems. A 95% success rate overall might mean 99% for simple queries and 70% for complex reasoning tasks.
Build dashboards organized by concern
Group panels into health (latency, errors, throughput), quality (correctness, hallucination rate), and cost (tokens, spend). Every panel should answer a specific operational question.
Use SLO-based alerting with error budget burn rates
Alert based on how fast you're consuming your error budget, not on individual errors. A fast burn (14.4x budget/hour) pages immediately. A slow burn (6x budget/6h) sends a Slack notification.
Every alert must have a runbook
An alert without a runbook is just noise. Document what the alert means, likely causes, and step-by-step investigation and remediation procedures.
Tier alert severity and routing
Info goes to logs, warnings go to Slack, only critical issues page on-call. If everything is critical, nothing is.
Conduct post-incident reviews for AI-specific failures
Traditional post-mortems need adaptation for AI systems. Include: what context did the agent see? Was it a model failure or a context failure? Would better observability have caught this earlier?
Classify every failure: model, context, tool, or orchestration
The fix depends on the failure type. A hallucination (model failure) needs a different fix than missing RAG documents (context failure) or a tool timeout (tool failure). Trace-based debugging makes classification possible.
Store full execution context for replay debugging
Capture the input, system prompt, messages, tool results, and model parameters for every request (or a sample). This enables reproducing non-deterministic failures by feeding the exact same context back to the model.
Build a debugging workflow that starts from the trace
Every investigation should start with the trace ID. Pull up the trace tree, identify the failing span, inspect its inputs and outputs, classify the failure, and then fix. Never start by guessing.
Add failing cases to your eval dataset automatically
When you debug and fix a failure, add the input/expected output pair to your evaluation dataset. This prevents regression and builds a growing safety net of test cases from real production issues.
Track cost per request, per feature, and per user
Calculate cost using model pricing tables at the request level. Attribute to features and users. This is the only way to answer 'where is the money going?'
Set budget alerts and per-request cost caps
Alert when a single request exceeds $0.50 or when hourly spend exceeds 2x the 7-day average. Implement circuit breakers for runaway costs.
Monitor token efficiency over time
Track the ratio of useful output tokens to total tokens consumed. A rising ratio means your context engineering is improving. A falling ratio means you're wasting tokens on noise.
Implement model routing based on complexity
Not every request needs GPT-4o. Route simple queries to cheaper models and save expensive models for complex reasoning. Track cost savings from routing decisions.
The Guiding Principle
At any point in time, you should be able to answer three questions about your agent: Is it working correctly? How much is it costing? Where are the bottlenecks? If you cannot answer all three from your observability data within 60 seconds, your observability stack has gaps.
— Adapted from SRE best practices applied to AI systems
Resources & Further Reading
Tools, docs, repos, and guides
Essential tools, documentation, repositories, and guides for building observable AI agent systems. From open-source platforms to industry best practices.
Langfuse — Open Source LLM Observability
Open-source LLM observability platform with tracing, prompt management, evaluations, and cost tracking. Self-hostable.
LangSmith — LangChain Observability Platform
Production monitoring and debugging platform for LLM applications. Deep integration with LangChain/LangGraph but works with any framework.
Arize Phoenix — ML & LLM Observability
Open-source observability for LLMs with trace visualization, retrieval analysis, and evaluation tools. Built for debugging RAG and agent systems.
Helicone — LLM Gateway & Observability
Proxy-based LLM observability that requires just one line of code. Automatic cost tracking, latency monitoring, and request logging.
Braintrust — AI Product Evaluation
End-to-end platform for evaluating, monitoring, and improving AI products. Combines evals, logging, and prompt playground.
OpenTelemetry for LLM Observability
The vendor-neutral standard for distributed tracing. Use OTEL SDKs to instrument your agents and export to any backend.
OpenLLMetry — OpenTelemetry for LLMs
Open-source auto-instrumentation for LLM frameworks using OpenTelemetry. Automatically traces LangChain, OpenAI, and more.
Honeycomb Guide to Observability
Charity Majors' team explains the difference between monitoring and observability. Essential reading for understanding observability principles.
Observability for LLM-based Agents
LangChain's guide to building observable LLM applications. Covers tracing, evaluation, and production monitoring patterns.
Building Observable AI Systems
Anthropic's engineering blog with insights on building production AI systems including monitoring and evaluation best practices.
Recommended Starting Point
If you are just getting started with agent observability, begin with Langfuse (open-source, generous free tier) for tracing and cost tracking, read the Honeycomb guide to understand observability principles, and add OpenTelemetry instrumentation for vendor-neutral trace collection. You can always switch backends later without re-instrumenting your code.