The Complete Guide — Tool Use & MCP

Tool Use
& MCP
Academy

Master tool design for AI agents and the Model Context Protocol. From function calling fundamentals to production MCP servers — give your agents the tools they need to act in the real world.

12 Chapters7 Code Examples6 Anti-Patterns10 ResourcesMCP Server Guide
1

Tool Use Fundamentals

How LLMs call functions and use results

Tool use (also called function calling) is the mechanism that lets LLMs interact with the outside world. Instead of only generating text, the model can output structured requests to call functions you define — fetching real-time data, executing actions, and accessing systems the model has no direct connection to.

This is what separates a chatbot from an agent. Without tools, a model can only reason over its training data and the text in its context window. With tools, it can check live databases, call APIs, read files, run code, and take actions in the real world.

The Tool Call Lifecycle

Every tool interaction follows a 4-step lifecycle. Understanding this loop is essential — it determines where you add validation, error handling, and security controls.

1

Tool Definition

You define tools with names, descriptions, and JSON Schema parameters. These definitions are injected into the model's context alongside the system prompt.

Each tool definition consumes context tokens. A tool with a complex schema can use 200-500 tokens just for its definition.

2

Model Decides to Call a Tool

Based on the user's message and available tools, the model outputs a structured tool call instead of a text response. It selects the tool name and generates parameters.

The model returns a special "tool_use" content block with the tool name and arguments as JSON. It can also include text reasoning before the tool call.

3

Your Code Executes the Tool

Your application receives the tool call, validates the parameters, executes the underlying function (API call, database query, etc.), and captures the result.

This is where validation, sandboxing, error handling, and logging happen. The model never executes code directly — your application is always in the loop.

4

Result Injected Back into Context

The tool result is sent back to the model as a "tool_result" message. The model then generates its final response using both the original context and the tool output.

Tool results consume context tokens. A large result (raw API response) can waste thousands of tokens. Keep results minimal and structured.

AdvancedParallel Tool Calls

When a task requires multiple independent pieces of information, the model can emit multiple tool calls in a single response. Your application executes them concurrently and returns all results at once.

// Model emits two tool calls in one response: [ { tool: "get_weather", args: { city: "New York" } }, { tool: "get_weather", args: { city: "London" } }, ] // Your app executes both concurrently: const results = await Promise.all( toolCalls.map((call) => executeTool(call)) ); // Both results sent back → model gives unified answer
+Reduces round trips between model and application
+Enables concurrent execution for faster total latency
+Natural fit for questions like 'Compare weather in NYC and London'
+Model reasons over all results together for a unified response

Key Principles

The Model Never Executes Code

Tool use is structured output — the model generates JSON describing what to call and with what parameters. Your application handles all execution. This is a fundamental security boundary.

Tool Calls Are Another Turn in the Conversation

A tool call creates a new message pair in the conversation: the assistant's tool_use message and the tool_result response. Both consume context tokens and persist in history.

The Model Sees Tool Definitions as Context

Every tool definition is serialized into the context window alongside the system prompt. 20 tools with complex schemas can consume 5,000-10,000 tokens before any conversation happens.

Tool Results Replace the Model's Knowledge

When a tool returns data, the model treats it as ground truth. This is powerful (real-time data) but dangerous (if the tool returns wrong data, the model will confidently present it).

Key Insight

Tool use is not code execution — it is structured output. The model generates JSON that describes a desired action, and your application decides whether and how to execute it. This distinction is the foundation of safe, reliable agent systems. You are always in the loop.

2

Designing Good Tools

Clear names, schemas, and descriptions

Tool design is API design for AI models. The same principles that make APIs developer-friendly — clear naming, good documentation, typed schemas — make tools model-friendly. But there are unique considerations: models read descriptions literally, can't ask clarifying questions, and will creatively misuse ambiguous interfaces.

Naming Conventions

Use verb_noun format

Good

search_orders, create_ticket, get_user

Bad

handle_data, process, doStuff

The verb tells the model what action to take. The noun tells it what entity is affected. Together they communicate intent unambiguously.

Be specific, not generic

Good

list_open_tickets, get_order_by_id

Bad

get_items, fetch_data, query

Generic names force the model to rely on descriptions alone. Specific names provide instant signal for tool selection.

Avoid overlapping names

Good

search_orders (by query), get_order (by ID)

Bad

find_orders, search_orders, lookup_orders

Overlapping names cause the model to pick randomly between near-identical tools. Each tool name should be clearly distinguishable.

Use consistent conventions

Good

get_ (read), create_ (write), update_ (modify), delete_ (remove)

Bad

Mixing get_, fetch_, retrieve_, find_, lookup_ for reads

Consistent prefixes create a predictable vocabulary. The model learns that get_ always means read, create_ always means write.

CriticalAnatomy of a Good Tool Description

Tool descriptions are the model's only guide for choosing and using your tools. A good description has these parts:

Required

What it does

Retrieves order details including items, status, and shipping info.

Required

When to use it

Use when the user asks about a specific order and has an order ID.

Helpful

When NOT to use it

Do not use for searching orders — use search_orders instead.

Required

What it returns

Returns order status, item list, shipping address, and payment info.

Helpful

Edge cases

Returns an error if the order ID is invalid or the order was deleted.

Parameter Design

Add descriptions to every parameter

// Good: description explains format and purpose { order_id: { type: "string", description: "Order ID in format 'ORD-1234'" } } // Bad: no description { id: { type: "string" } }

Use enums for constrained values

// Good: model can only pick valid values { status: { type: "string", enum: ["open", "closed", "pending"], description: "Filter by ticket status" } } // Bad: model has to guess valid values { status: { type: "string" } }

Set bounds on numbers and arrays

// Good: clear boundaries { limit: { type: "number", minimum: 1, maximum: 100, default: 10, description: "Max results to return (1-100)" }, tags: { type: "array", items: { type: "string" }, maxItems: 10, description: "Filter by tags (max 10)" } }

Mark required vs optional clearly

{ type: "object", properties: { query: { type: "string", description: "Search query (required)" }, page: { type: "number", description: "Page number (optional, default: 1)" } }, required: ["query"] }
Tool Granularity — Finding the Right Level

Each tool should represent one logical action at the right level of abstraction. Too fine forces unnecessary orchestration. Too coarse creates a confusing mega-tool.

Too Fine
open_db_connection, write_sql_query, execute_query, parse_results, close_connection

Forces the model to orchestrate low-level steps it shouldn't need to know about. 5 tool calls instead of 1.

Just Right
search_orders(query, filters) - handles connection, query, parsing internally

One tool call, one logical action. Internal complexity is hidden behind a clean interface.

Too Coarse
manage_database(action, table, data, query, filters, sort, limit, offset, join, ...)

Swiss-army-knife tool with 15 parameters. Model has to figure out which combination to use. Errors are hard to diagnose.

The Output Rule

Tool outputs go back into the context window. Every unnecessary field, every raw API dump, every untruncated list competes for the model's attention and eats your token budget. Return only what the model needs to formulate its response — typically 10-20% of the raw data. A well-designed tool output is the highest-leverage context optimization you can make.

3

Model Context Protocol (MCP)

The standard for agent-tool communication

The Model Context Protocol (MCP) is an open standard that defines how AI applications connect to external tools and data sources. Released by Anthropic in November 2024, MCP replaces the fragmented landscape of custom tool integrations with a single, universal protocol.

Think of it like USB-C for AI: before MCP, every AI app needed custom connectors for every tool. With MCP, you build one server and it works with every MCP-compatible client — Claude Desktop, Cursor, Zed, Windsurf, and any custom application.

MCP is an open protocol that standardizes how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications.

David Soria Parra

Creator of MCP, Anthropic

By giving Claude access to tools, you can extend its capabilities far beyond text generation. Tools let Claude interact with external systems, fetch real-time data, and take actions in the world.

Anthropic

Tool Use Documentation

The quality of your tools defines the ceiling of your agent's capabilities. A model can reason perfectly and still fail if its tools are poorly designed.

Alex Albert

Prompt Engineering Lead, Anthropic

Before MCPThe Problem

N x M Integration Problem

Every AI application needed custom code for every data source. 5 AI apps and 5 tools meant 25 separate integrations to build and maintain.

No Discovery Mechanism

Tools were hardcoded. There was no way for a client to ask a server 'What tools do you have?' — you had to know the API in advance.

Inconsistent Schemas

Every tool server defined parameters differently. No standard for describing tool inputs, outputs, or error formats.

No Capability Negotiation

Clients couldn't check if a server supported the features they needed. Version mismatches caused silent failures.

Architecture

MCP uses a client-server architecture with three layers: hosts that run AI applications, clients that manage protocol connections, and servers that expose capabilities.

MCP Hosts

Applications that want to access tools and data via MCP. Examples: Claude Desktop, IDE extensions, custom AI applications. The host creates and manages MCP client instances.

Examples:

Claude Desktop, Cursor, Zed, Windsurf, custom apps

MCP Clients

Protocol clients that maintain 1:1 connections with MCP servers. Each client connects to one server and handles the JSON-RPC communication protocol. Created by hosts.

Examples:

Built into the host — one client per server connection

MCP Servers

Lightweight programs that expose tools, resources, and prompts through the standard MCP protocol. Each server typically wraps one data source or service.

Examples:

GitHub server, PostgreSQL server, filesystem server, Slack server

The Three MCP Primitives

MCP servers expose three types of capabilities. Each has a different control flow — understanding who initiates each is key.

Core

Tools

Model-initiated

Functions the model can call. Each tool has a name, description, and parameter schema. The server defines them; the model invokes them through the client.

search_issues, run_query, send_message
Core

Resources

Application-initiated

Data sources the client can read, like files or database records. Resources are identified by URIs and can be watched for changes. Better than tools for frequently accessed read-only data.

file:///config.json, db://users/123, git://repo/main
Optional

Prompts

User-initiated

Reusable prompt templates that the server provides. Useful for standardized interactions — the server knows the best way to phrase requests for its domain.

summarize_pr, explain_error, generate_migration
Protocol Lifecycle
1

Initialize

Client sends 'initialize' with its protocol version and capabilities. Server responds with its own version, capabilities, and server info. Both sides learn what the other supports.

2

Capability Discovery

Client queries available tools (tools/list), resources (resources/list), and prompts (prompts/list). Server returns schemas for each. Client now knows everything the server offers.

3

Operation

Client calls tools (tools/call), reads resources (resources/read), or renders prompts (prompts/get). Server executes and returns results. Notifications flow both ways for events like resource changes.

4

Shutdown

Either side can close the connection cleanly. The transport layer (stdio pipe or SSE connection) is closed gracefully.

Transport Layers

stdio (Standard I/O)

Communication over stdin/stdout. The client launches the server as a child process. Simple, secure (no network exposure), and the default for local tools.

Best for: Local development, desktop apps, CLI tools
No network setupSecure by defaultSimple debugging

SSE (Server-Sent Events)

HTTP-based transport using Server-Sent Events for server-to-client messages and HTTP POST for client-to-server messages. Enables remote servers.

Best for: Remote servers, cloud deployments, shared tools
Works over networkFirewall-friendly (HTTP)Scalable

Key Insight

MCP turns the N x M integration problem into N + M. Instead of every AI app building custom connectors for every tool, tool authors build one MCP server and app developers add one MCP client. A new tool instantly works with every MCP-compatible application.

4

Building MCP Servers

Create tools any MCP client can use

Building an MCP server means creating a program that exposes tools, resources, and prompts through the standard MCP protocol. The official TypeScript SDK handles protocol details — you focus on the tools themselves.

An MCP server is typically a small, focused program: 100-500 lines for a well-scoped server. It wraps a single data source or service (GitHub, a database, Slack) and makes it available to any MCP client.

Building a Server, Step by Step

1Initialize the Server
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; const server = new McpServer({ name: "my-server", version: "1.0.0", });

Create an McpServer instance with a name and version. The name helps clients identify your server; the version enables capability negotiation.

2Define Tools with Zod Schemas
import { z } from "zod"; server.tool( "search_issues", "Search GitHub issues by query, label, or status. " + "Returns up to 20 matching issues with title, status, and URL.", { query: z.string().describe("Search query string"), label: z.string().optional().describe("Filter by label name"), state: z.enum(["open", "closed", "all"]).default("open") .describe("Filter by issue state"), limit: z.number().int().min(1).max(50).default(20) .describe("Max results to return"), }, async ({ query, label, state, limit }) => { const issues = await github.searchIssues({ query, label, state, limit }); return { content: [ { type: "text" as const, text: JSON.stringify( issues.map((issue) => ({ number: issue.number, title: issue.title, state: issue.state, url: issue.html_url, labels: issue.labels.map((l) => l.name), })), null, 2, ), }, ], }; }, );

server.tool() registers a tool with name, description, Zod schema for parameters, and an async handler. The SDK automatically validates inputs against the schema. Return content as an array of typed content blocks.

3Define Resources
server.resource( "repo-readme", "github://repo/README.md", { mimeType: "text/markdown" }, async (uri) => { const content = await github.getFileContent("README.md"); return { contents: [ { uri: uri.href, mimeType: "text/markdown", text: content, }, ], }; }, );

Resources expose data the client can read without the model having to call a tool. URI-based addressing (github://repo/README.md) makes resources discoverable and cacheable.

4Connect a Transport
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; const transport = new StdioServerTransport(); await server.connect(transport); // For remote servers, use SSE instead: // import { SSEServerTransport } from // "@modelcontextprotocol/sdk/server/sse.js"; // app.get("/sse", (req, res) => { // const transport = new SSEServerTransport("/messages", res); // server.connect(transport); // });

StdioServerTransport communicates over stdin/stdout — the client launches your server as a child process. For remote access, use SSEServerTransport over HTTP. The server logic is identical regardless of transport.

ConfigConnecting to Claude Desktop

Add your server to Claude Desktop by editing the configuration file. Claude will automatically launch the server and discover its tools.

// claude_desktop_config.json { "mcpServers": { "my-server": { "command": "node", "args": ["path/to/my-server/dist/index.js"], "env": { "API_KEY": "your-api-key" } } } }

Real-World Server Patterns

Database Query Server

run_querylist_tablesdescribe_table

Exposes read-only SQL query execution with schema introspection. Uses parameterized queries to prevent injection. Returns column names and rows as structured data.

Key pattern: Read-only connection, query validation, result truncation

Git Repository Server

search_codeget_filelist_branchesget_diff

Provides code search, file reading, branch listing, and diff viewing for a git repository. Resources expose frequently accessed files like README and config.

Key pattern: Resources for static files, tools for dynamic queries

Notification Server

send_emailsend_slacklist_channels

Sends messages through email and Slack. Requires explicit confirmation for sends (returns preview first). Rate-limited to prevent spam.

Key pattern: Confirmation step, rate limiting, audit logging
Testing Your MCP Server

MCP Inspector

Official interactive testing tool. Connect your server and manually test each tool, view schemas, and verify error handling. Best for development and debugging.

npx @modelcontextprotocol/inspector your-server.js

Unit Tests with In-Memory Transport

Create an InMemoryTransport pair for automated testing. Call tools programmatically and assert on responses. No external dependencies needed.

vitest run --coverage

Claude Desktop Integration

Add your server to Claude Desktop's config for end-to-end testing. Verify the model can discover, select, and correctly use your tools in real conversations.

Edit ~/Library/Application Support/Claude/claude_desktop_config.json

Key Insight

The best MCP servers are small and focused. A server that wraps GitHub with 5 well-designed tools is better than a “universal” server with 50 tools across 10 services. Clients can compose multiple focused servers. Each server should do one thing well.

5

Tool Selection & Routing

Dynamic tool loading for large registries

Tool selection is how you decide which tools the model sees for each request. With a small tool set (under 10), you can include all tools every time. But as your registry grows past 20 tools, you need dynamic selection — otherwise the model drowns in definitions and picks the wrong tool.

This is one of the highest-leverage optimizations in tool-using agents. Research from UC Berkeley (Gorilla) and LangChain (Bigtool) shows that tool selection accuracy drops dramatically as tool count increases — from 95% with 10 tools to 30% with 50+.

The Tool Scaling Problem

More tools means more context tokens, more confusion, and worse selection accuracy. These numbers are approximate but reflect consistent findings across multiple research papers.

5-10
Accuracy: 95%+Context: ~1,500
Optimal
10-20
Accuracy: 85-90%Context: ~4,000
Acceptable
20-50
Accuracy: 50-70%Context: ~10,000
Degraded
50+
Accuracy: 30-40%Context: 15,000+
Broken

Selection Strategies

RecommendedRAG over Tool DescriptionsLangChain Bigtool

Embed all tool descriptions in a vector database. When a request arrives, embed the user message and find the most similar tool descriptions. Return only the top-k matches.

How It Works

  1. 1.Index each tool's name + description as a document in a vector DB
  2. 2.On each request, embed the user message
  3. 3.Search for top-k most similar tool descriptions (k=5-10)
  4. 4.Include matched tools in the LLM call alongside core tools
  5. 5.Model sees only relevant tools, improving selection accuracy
Use when: Registries with 20+ tools. The standard approach for production agents.
SimpleCategory-Based RoutingCommon Pattern

Group tools into categories (billing, support, admin). Use a lightweight classifier to route to the right category first, then provide all tools within that category.

How It Works

  1. 1.Define tool categories: billing (5 tools), support (8 tools), admin (4 tools)
  2. 2.Use intent classification to detect the category from the user message
  3. 3.Load all tools in the matched category
  4. 4.Optionally use RAG within a category for further refinement
  5. 5.Fast and predictable — no embedding needed for the routing step
Use when: Well-defined domains with clear boundaries. Good when tools naturally cluster.
AdvancedTwo-Stage SelectionProduction Pattern

Combine coarse and fine selection. First, use category routing or keyword matching to narrow to 15-20 candidate tools. Then use RAG or a small model to pick the final 5-8.

How It Works

  1. 1.Stage 1: Fast filter using keywords, categories, or a classifier
  2. 2.Stage 2: Semantic search over the filtered subset for precise matching
  3. 3.Always include 'core' tools (think, respond_to_user) regardless of routing
  4. 4.Cache frequently co-occurring tool sets to skip routing for common queries
  5. 5.Monitor which tools actually get selected to optimize routing rules
Use when: Very large registries (100+ tools) or latency-sensitive applications.
Core vs Dynamic Tools

Separate your tools into two groups: core tools that are always available (reasoning, responding) and dynamic tools that are loaded per-request based on relevance.

Core
think

Always available. Lets the model reason through complex decisions before acting.

Core
respond_to_user

Always available. Sends a final response to the user.

Core
request_clarification

Always available. Asks the user for more information when the query is ambiguous.

Dynamic
search_orders

Dynamically loaded when the user's message relates to orders, purchases, or shipments.

Dynamic
create_ticket

Dynamically loaded when the user's message relates to support, issues, or escalation.

Key Insight

Dynamic tool selection is not premature optimization — it is essential architecture for any agent with more than 15-20 tools. Without it, you are paying the full context cost of every tool definition on every request and accepting a 3x degradation in tool selection accuracy. The investment in a tool index pays back on every single request.

6

Error Handling & Retries

Graceful failure when tools break

Tools call external APIs, query databases, and interact with services that can fail. Error handling determines whether a tool failure crashes the agent, confuses the model, or gets handled gracefully with a recovery path.

The key principle: never return raw errors to the model. Stack traces waste tokens, expose internals, and confuse the model. Instead, return structured error objects that tell the model exactly what went wrong and what to do next.

Error Categories

Validation Errors

Retryable

The model sent invalid parameters — wrong type, missing required field, value out of range. These are the most common tool errors.

ZodError: amount must be a positive number (received -50)
Strategy: Return field-level error details and let the model self-correct. The model is usually excellent at fixing parameter errors when given clear feedback.

Not Found Errors

Retryable

The requested resource does not exist — invalid ID, wrong table name, deleted record.

No order found with ID 'ORD-9999'. Valid format: ORD-XXXX.
Strategy: Return what valid IDs or resources look like. If possible, suggest an alternative (e.g., list similar records).

Permission Errors

Non-retryable

The tool does not have permission to perform the requested action — restricted path, insufficient API scope.

Permission denied: cannot access /etc/passwd. Allowed: /workspace/*
Strategy: Return a clear denial. Do NOT suggest workarounds that bypass security. Mark as non-retryable.

Transient Errors

Retryable

Temporary failures — network timeouts, rate limits, service unavailability. The operation might succeed if retried.

Service temporarily unavailable. Retry in 5 seconds.
Strategy: Implement automatic retry with exponential backoff. If all retries fail, return a structured error with estimated recovery time.
Designing Error Responses for Models

Every tool should return a consistent error structure. These fields give the model everything it needs to handle the failure intelligently.

success
boolean

Lets the model (and your code) quickly determine if the tool call worked.

error.code
string

Machine-readable error category: VALIDATION_ERROR, NOT_FOUND, PERMISSION_DENIED, TIMEOUT.

error.message
string

Human-readable explanation the model can relay to the user or use for self-correction.

error.retryable
boolean

Tells the model whether to retry or give up. Prevents infinite retry loops.

error.suggestion
string?

Specific guidance: 'Available tables: users, orders' or 'Try again in 5 seconds.'

PatternRetry with Exponential Backoff
1

Classify the error

Determine if the error is transient (retry) or permanent (fail fast). Rate limits, timeouts, and 503s are transient. Auth failures, 404s, and validation errors are permanent.

2

Apply exponential backoff

Wait 1s, then 2s, then 4s between retries. Add random jitter (0-500ms) to prevent thundering herd. Cap at 3-5 retries maximum.

3

Circuit breaker on repeated failures

If a tool fails 5 times in 60 seconds, trip the circuit breaker. Return instant failure for the next 30 seconds instead of waiting for timeouts.

4

Fallback response

When all retries fail and the circuit is open, return a graceful fallback: cached data, a degraded response, or an explicit 'service unavailable' message.

Circuit Breaker Pattern

A circuit breaker prevents a failing tool from consuming resources on doomed requests. It has three states:

C

Closed

Normal operation. Tool calls execute as usual. Failure count is tracked.

O

Open

Failure threshold exceeded. All calls fail immediately with a cached error. No execution attempts.

H

Half-Open

After cooldown, one test call is allowed. If it succeeds, circuit closes. If it fails, circuit reopens.

Key Insight

The model cannot debug your infrastructure. When a tool fails, the model needs actionable information, not stack traces. Every error response should answer three questions: What happened? Can I retry? What should I do instead? If your errors answer these three questions, the model can recover from most failures without human intervention.

7

Security & Sandboxing

Safe tool execution in production

Tools give the model real-world agency — the ability to read files, query databases, send messages, and execute commands. This power requires a security model that assumes the model's outputs are untrusted user input.

The model can be manipulated via prompt injection, can hallucinate incorrect parameters, or can misunderstand ambiguous requests. Every tool interaction must pass through validation, sandboxing, and permission checks before execution.

Threat Model

critical

Prompt Injection via Tool Inputs

An attacker crafts user input that causes the model to generate malicious tool parameters — SQL injection, command injection, or path traversal.

Example

User: 'Search for '); DROP TABLE users; --'

Mitigation

Validate all tool inputs with Zod schemas. Use parameterized queries for SQL. Never pass model-generated strings directly to shell commands or eval().

critical

Data Exfiltration via Tool Results

A compromised or poorly designed tool returns sensitive internal data (API keys, session tokens, credentials) that the model then includes in its response to the user.

Example

Tool returns raw API response containing internal_api_key field

Mitigation

Filter tool outputs at the boundary. Define explicit response types that include only user-facing fields. Strip sensitive data before returning to the model.

high

Privilege Escalation

A tool runs with the application's full permissions. The model (or a prompt injection) exploits this to access resources the user should not have access to.

Example

File tool with no path restrictions reads /etc/passwd

Mitigation

Apply least privilege: read-only DB connections for read tools, scoped API keys, sandboxed file access. Create separate service accounts per tool category.

high

Denial of Service via Tool Abuse

The model enters an infinite retry loop, calls a tool thousands of times, or triggers expensive operations that consume resources or rack up API costs.

Example

Model retries a failing API call 500 times in a loop

Mitigation

Set per-tool rate limits, maximum retries (3-5), execution timeouts (5-30s), and per-session cost budgets. Use circuit breakers for external services.

medium

Tool Confusion Attack

An attacker tricks the model into calling the wrong tool by carefully crafting input that matches one tool's description more closely than the intended tool.

Example

Input crafted to trigger delete_account instead of get_account_info

Mitigation

Require confirmation for destructive operations. Use distinct, non-overlapping tool descriptions. Implement undo/soft-delete instead of hard deletes.

Defense in Depth: Validation Layers

Each layer catches a different class of attack. All four should be applied to tools that interact with system resources.

1

Schema Validation (Zod)

Validate types, constraints, enums, and patterns on every parameter before execution. Catch malformed inputs at the gate.

const params = SearchSchema.parse(rawParams); // Throws ZodError with field-level details
2

Input Sanitization

Escape or reject inputs that could be interpreted as commands. Prevent SQL injection, shell injection, and path traversal.

// Parameterized query — NOT string interpolation const result = await db.query( "SELECT * FROM orders WHERE id = $1", [params.orderId] );
3

Path Normalization

Resolve and validate file paths to prevent directory traversal. Reject paths outside the allowed directory.

const resolved = path.resolve(WORKSPACE, filePath); if (!resolved.startsWith(WORKSPACE)) { throw new Error("Path traversal detected"); }
4

Command Allowlisting

Maintain an explicit list of allowed commands. Reject anything not on the list. Never use a blocklist — attackers will find what you missed.

const ALLOWED = new Set(["ls", "cat", "grep"]); if (!ALLOWED.has(command)) { return { error: "Command not allowed" }; }

Permission Model

Classify every tool into a permission tier based on its potential impact. Each tier has different execution policies.

R

Read-Only

Safe to auto-execute. Cannot modify state. Uses read-only database connections and API keys with read scopes.

search_knowledge_baseget_user_profilelist_orders
W

Write with Confirmation

Creates or modifies data. Returns a preview and requires explicit confirmation before executing. Idempotent where possible.

create_ticketupdate_statusadd_comment
D

Destructive (Human Required)

Irreversible actions. Always requires human approval in the loop. Cannot be auto-confirmed. Logged with full audit trail.

delete_accountprocess_refundsend_mass_email
RequiredAudit Logging

Every tool call should be logged in an append-only audit trail. This is essential for debugging, security investigations, cost tracking, and compliance.

timestamp

When the tool was called (ISO 8601)

tool_name

Which tool was invoked

parameters

Input parameters (sanitized — no secrets)

result_status

success | error | timeout | denied

duration_ms

How long the execution took

caller_id

Which user/session triggered the call

model_id

Which model made the tool call

cost_tokens

Tokens consumed for this tool interaction

Key Insight

Treat every tool call as if it came from an untrusted external user — because in a real sense, it did. The model's outputs are influenced by user input, training data, and potentially adversarial prompts. Validate, scope, sandbox, and audit every tool interaction. The security boundary is not between the user and the model — it is between the model and your tools.

8

Production Tool Architecture

Versioning, monitoring, and scaling tools

Production tool architecture goes beyond writing individual tools. It encompasses versioning (so tools can evolve without breaking clients), monitoring (so you know when tools degrade), scaling (so tools handle production load), and governance (so your tool registry stays manageable).

Versioning Strategy

Schema Versioning

When you change a tool's parameters (add required fields, remove fields, change types), you must version the schema. Clients that depend on the old schema will break otherwise.

Approach: Semantic versioning: major for breaking changes, minor for additions, patch for fixes. Include version in tool name or metadata.
Example: search_orders_v2 (adds date_range param) coexists with search_orders_v1 during migration

Deprecation Workflow

Before removing a tool, mark it as deprecated. Emit warnings in responses. Set a sunset date. Remove only after all clients have migrated.

Approach: Phase 1: Mark deprecated (still works, logs warnings). Phase 2: Return deprecation notice in output. Phase 3: Remove after sunset date.
Example: Tool returns: { data: {...}, _deprecated: 'Use search_orders_v2 instead. Sunset: 2025-03-01' }

Backward Compatibility

New optional parameters should have defaults. New response fields should be additive (never remove existing fields in a minor version).

Approach: Adding optional params is a minor change. Making params required or removing fields is a major change. Use z.default() liberally.
Example: Adding optional 'date_range' param with default('all') is backward-compatible
Monitoring & Observability

Track these metrics per tool. Each category has different targets and alert thresholds.

LatencyAlert on: p95 > 3s for 5 minutes
p50 latency< 500ms

Median tool execution time

p95 latency< 2s

95th percentile — catches slow outliers

p99 latency< 5s

99th percentile — worst case scenarios

Error RateAlert on: Error rate > 5% for 2 minutes
Error rate< 1%

Percentage of tool calls that fail

Validation error rate< 5%

Model sending invalid params

Timeout rate< 0.5%

Calls exceeding timeout threshold

UsageAlert on: Cost/hour > 2x baseline
Calls/minuteVaries

Tool invocation volume for capacity planning

Tokens/call< 500

Context tokens consumed per tool interaction

Cost/call< $0.01

Total cost including API calls and compute

Deployment Patterns

Sidecar Pattern

MCP server runs alongside the main application as a separate process. Communicates via stdio or local network. Simple to deploy and manage.

Best for: Local development, desktop applications, single-tenant deployments
+Simple setup
+No network overhead
+Process isolation
-Not horizontally scalable
-Tied to host lifecycle
Service Pattern

MCP server deployed as an independent service with its own lifecycle, scaling, and monitoring. Accessed via SSE transport over HTTP.

Best for: Multi-tenant platforms, cloud deployments, shared tool infrastructure
+Independent scaling
+Shared across clients
+Standard deployment
-Network latency
-Auth complexity
-More infrastructure
Gateway Pattern

A central MCP gateway aggregates multiple MCP servers behind a single endpoint. Handles routing, auth, rate limiting, and tool selection centrally.

Best for: Enterprise deployments with many tool servers, multi-team environments
+Centralized governance
+Unified auth
+Tool selection at gateway
-Single point of failure
-Added latency
-Complex to operate
Scaling Considerations

Stateless Tool Servers

Design tool servers to be stateless where possible. Store session state externally (Redis, database). This enables horizontal scaling — add more instances behind a load balancer.

Connection Pooling

Tools that connect to databases or APIs should use connection pools. A tool server handling 100 concurrent requests should not open 100 database connections.

Caching Tool Results

Cache frequently requested, slowly changing data. A tool that fetches company policies does not need to hit the database on every call. Use TTL-based caching with cache invalidation.

Rate Limiting by Tool

Different tools have different cost profiles. A search tool is cheap; a payment processing tool is expensive and rate-sensitive. Apply per-tool rate limits that match the underlying service constraints.

Graceful Degradation

When a tool server is overloaded, shed load gracefully. Return cached results, queue requests, or return a 'service busy' error with retry guidance — never drop requests silently.

GovernanceTool Registry Design

A tool registry is the source of truth for all available tools. Each entry should include metadata for discovery, governance, and operations.

name

Unique tool identifier

search_orders
version

Semantic version of the tool schema

2.1.0
owner

Team or individual responsible

commerce-team
status

Lifecycle state

active | deprecated | sunset
description

Model-facing description for tool selection

Search orders by email, date, or status...
schema

Zod schema for input validation

z.object({ query: z.string(), ... })
metrics

Real-time usage and performance data

{ calls_24h: 15420, p95_ms: 340, error_rate: 0.2% }
dependencies

External services this tool requires

['postgres', 'stripe-api']

Key Insight

Tools are infrastructure, not just code. They need the same operational discipline as any production service: versioning, monitoring, alerting, capacity planning, and governance. A tool that works in demos but has no monitoring, no versioning, and no ownership is technical debt waiting to become an incident.

9

Interactive Examples

See tool use patterns in action with live code

See tool use patterns in action. Each example shows a bad pattern and its production-ready fix. Toggle between them to understand the difference.

Tool DesignTool Naming & Descriptions

Clear names and descriptions are critical for tool selection

Vague names, no descriptions
// BAD: Ambiguous tool names with no descriptions
const tools = [
  {
    name: "do_stuff",
    description: "Does stuff with the database",
    parameters: {
      type: "object",
      properties: {
        data: { type: "string" },
      },
    },
  },
  {
    name: "handle_request",
    description: "Handles a request",
    parameters: {
      type: "object",
      properties: {
        input: { type: "string" },
      },
    },
  },
];

// Model has no idea which tool to use or what "data" means
const response = await llm.generate({
  tools,
  messages: [{ role: "user", content: "Look up order #1234" }],
});

Why this fails

Vague names like 'do_stuff' and generic descriptions give the model no signal for when to use which tool. Parameter names like 'data' and 'input' don't communicate expected formats or constraints.

All Examples Quick Reference

Tool Design

Tool Naming & Descriptions

Clear names and descriptions are critical for tool selection

Token Efficiency

Tool Output Design

What your tools return goes directly into the context window

Reliability

Schema Validation with Zod

Validate tool inputs and outputs at runtime

MCP

MCP Server Implementation

Building an MCP server with the TypeScript SDK

Reliability

Tool Error Handling

Design errors that help the model recover

Tool Selection

Dynamic Tool Selection

Load only the tools relevant to the current task

Security

Tool Execution Sandboxing

Isolate tool execution to prevent damage

10

Anti-Patterns & Failure Modes

Tool soup, leaky tools, and how to avoid them

Knowing what not to do is as important as knowing what to do. These are the most common failure modes in tool-using AI systems, drawn from production experience and research on function calling accuracy.

CriticalHighMedium
criticalTool Sprawl

Registering dozens of tools with overlapping functionality, forcing the model to choose between near-identical options.

Cause

Adding tools incrementally without auditing for overlap. Three tools that all 'search records' — searchDB, querySQL, findRecords — each with slightly different schemas.

Symptom

Model picks the wrong tool 40%+ of the time. Wastes tokens deliberating between similar options. Tool call chains become unpredictable. Research shows accuracy drops from 95% to 30% going from 10 to 50 tools.

Fix

Audit your tool registry for overlaps. Merge similar tools into one with optional parameters. Keep under 15 tools per task context. Use dynamic tool selection (RAG over tool descriptions) for large registries.

criticalSchema Ambiguity

Tool parameters with vague names, missing descriptions, or no type constraints, leaving the model to guess the correct format.

Cause

Lazy parameter naming (data, input, params) with no descriptions. Missing required/optional distinctions. No examples of valid values.

Symptom

Model sends wrong parameter types (string instead of number), invents parameter names, or formats values incorrectly. Tool calls fail silently or produce unexpected results.

Fix

Use Zod or JSON Schema for every parameter. Add descriptions with examples. Mark required fields. Add patterns/enums for constrained values. If a human can't figure out the parameter format from the schema alone, the model can't either.

highResult Bloat

Tools return entire API responses — metadata, pagination, internal IDs, audit logs — consuming thousands of context tokens with irrelevant data.

Cause

Passing through raw API responses without filtering. Returning full database rows instead of relevant fields. No truncation on list results.

Symptom

Context window fills up fast. Model gets confused by irrelevant fields. Important information is buried in noise. Token costs increase dramatically with no quality improvement.

Fix

Filter tool outputs to only fields the model needs. Truncate lists (return top 5, not all 500). Flatten nested structures. Remove internal IDs, timestamps, and metadata the model won't use. Aim for 90%+ reduction in raw response size.

highSilent Failures

Tools fail without returning actionable error information, leaving the model unable to recover or explain what went wrong.

Cause

Catching exceptions and returning null, empty strings, or generic 'error occurred' messages. Swallowing errors to avoid 'messy' responses.

Symptom

Model hallucinates results when it receives null. Makes up data to fill gaps. Retries the same failing call indefinitely. User gets confident-sounding wrong answers because the model doesn't know the tool failed.

Fix

Return structured error objects with: error code, human-readable message, whether the error is recoverable, and a specific suggestion for what to try next. Never return null — always return an explicit error or success.

criticalPermission Creep

Tools operate with the application's full permissions — database admin, file system access, network access — instead of scoped, minimal privileges.

Cause

Using the same database connection, API keys, and service accounts for tool execution as for the main application. No sandboxing layer between the model and system resources.

Symptom

A single prompt injection or model hallucination can delete data, read sensitive files, or exfiltrate information. Security audit reveals tools have access to resources they never need.

Fix

Apply the principle of least privilege: create read-only database views for read tools, separate API keys with minimal scopes, sandbox file system access to a working directory, set resource limits (timeout, memory), and audit tool permissions regularly.

highHallucinated Tool Calls

The model invents tool names, parameters, or call patterns that don't exist in the provided tool definitions.

Cause

Tool names that are too similar to common patterns the model has seen in training data. Insufficient grounding in the tool schema. No validation layer between the model's output and tool execution.

Symptom

Model calls tools that don't exist (e.g., 'send_message' when only 'send_email' is defined). Invents parameters not in the schema. Attempts multi-step tool patterns it saw in training but aren't supported.

Fix

Validate every tool call against the registered schema before execution. Return clear 'tool not found' errors with the list of available tools. Use unique, specific tool names that are unlikely to be confused with training data patterns.

11

Best Practices Checklist

Production-ready guidelines for tool use and MCP

Production-ready guidelines for building tools and MCP servers, distilled from Anthropic documentation, the MCP specification, and real-world agent engineering experience.

Tool Design Principles

One tool, one action

Each tool should do exactly one thing. A 'manage_user' tool that creates, updates, and deletes is harder for the model to use correctly than separate create_user, update_user, and delete_user tools.

Use verb_noun naming

Name tools as verb_noun: search_orders, create_ticket, get_user. This instantly communicates the action and target. Avoid generic names like 'handle' or 'process'.

Write descriptions for the model, not humans

Tool descriptions are part of the model's context. Include when to use the tool, what it returns, and edge cases. 'Searches orders by email or date range. Use when user wants to find orders but doesn't have an order ID.'

Validate with Zod at the boundary

Parse and validate every tool input with Zod before execution. This catches type errors, missing fields, and out-of-range values before they cause downstream failures.

MCP Best Practices

Keep servers focused and composable

Build small, single-purpose MCP servers (one for GitHub, one for Jira, one for databases) rather than monolithic servers. Clients can compose multiple servers as needed.

Version your server protocol

Include version in your server's capabilities. When you add or change tools, increment the version so clients can detect incompatibilities.

Use resources for read-heavy data

MCP resources (file://, db://) are better than tools for data the client reads frequently. Resources can be cached and subscribed to, while tool calls always execute.

Test with the MCP Inspector

Use the official MCP Inspector tool to test your server independently before connecting to a client. It validates protocol compliance, schema correctness, and error handling.

Error Handling & Reliability

Never return null — always return structured results

Return explicit success or error objects from every tool. A null return causes the model to hallucinate results. An error object tells the model exactly what went wrong and how to recover.

Include recovery suggestions in errors

When a tool fails, return what the model should try next: 'Table not found. Available tables: users, orders, products.' This turns errors into learning opportunities.

Implement idempotent retries

Design write operations to be idempotent so the model can safely retry on failure. Use unique request IDs to prevent duplicate actions.

Set timeouts on every tool call

A tool that hangs blocks the entire agent loop. Set aggressive timeouts (5-30 seconds) and return a timeout error the model can handle, rather than waiting indefinitely.

Security & Production

Apply least privilege to every tool

Tools should have the minimum permissions needed. Read-only tools get read-only database access. File tools are restricted to a working directory. API keys have minimal scopes.

Validate and sanitize all tool inputs

Never pass model-generated strings directly to system commands, SQL queries, or file paths. Validate with schemas, use parameterized queries, and sanitize file paths.

Log every tool call for audit

Record the tool name, input parameters, output, duration, and caller for every tool execution. This is essential for debugging, cost tracking, and security auditing.

Require human approval for destructive actions

Tools that delete data, send messages, make payments, or modify infrastructure should require explicit human confirmation before execution. Never auto-approve irreversible actions.

Tool Selection & Routing

Keep active tool sets under 15 per request

Research shows tool selection accuracy drops significantly beyond 20 tools. Use dynamic selection via RAG to keep the active set small while supporting large registries.

Index tool descriptions for semantic search

Embed tool names and descriptions in a vector database. When a request arrives, find the top-k most relevant tools. This scales to hundreds of tools while keeping per-request context small.

Use tool categories for coarse routing

Group tools by domain (billing, search, admin). Route to a category first based on intent classification, then do fine-grained tool selection within that category.

Build evals for tool selection accuracy

Create test suites that verify the model picks the correct tool for known queries. Track selection accuracy as a metric. Regression test when adding new tools to catch description conflicts.

The Guiding Principle

Every tool is a contract between the model and the real world. The model promises to send valid parameters; the tool promises to return useful, safe results. When either side breaks the contract, the agent fails. Design tools that make the contract easy to uphold: clear schemas, structured errors, minimal outputs, and defense in depth.

— Anthropic, Tool Use Documentation

12

Resources & Further Reading

Docs, specs, repos, and guides

Essential documentation, specifications, repositories, and guides for mastering tool use and the Model Context Protocol.

guideAnthropic

Tool Use with Claude — Anthropic Documentation

Official documentation on Claude's tool use capabilities including function calling, parallel tool calls, and streaming tool results.

guideMCP

Model Context Protocol — Official Documentation

The official MCP specification, quickstart guides, and architecture overview. The definitive reference for building MCP servers and clients.

repoGitHub

Model Context Protocol TypeScript SDK

Official TypeScript SDK for building MCP servers and clients. Includes server framework, transport implementations, and examples.

blogAnthropic

Introducing the Model Context Protocol

Anthropic's announcement of MCP: why it was created, the problem it solves, and the vision for a universal tool standard.

repoGitHub

MCP Servers — Community Directory

Official and community-built MCP servers for GitHub, Slack, PostgreSQL, filesystem, and dozens more integrations.

paperarXiv / UC Berkeley

Gorilla: Large Language Model Connected with Massive APIs

Research on training LLMs to accurately use tools from large API registries. Key findings on tool selection accuracy degradation with scale.

paperarXiv / Meta

Toolformer: Language Models Can Teach Themselves to Use Tools

Foundational paper on how language models learn to use tools. Demonstrates self-supervised approaches to tool use learning.

blogAnthropic

Building Effective Agents — Anthropic

Comprehensive guide to building production agents, including tool design principles, orchestration patterns, and common failure modes.

guideOpenAI

Function Calling — OpenAI Documentation

OpenAI's approach to function calling, useful for understanding cross-platform tool design patterns and the JSON Schema specification.

repoGitHub

MCP Inspector — Testing Tool

Official tool for testing MCP servers interactively. Validates protocol compliance, tests tool schemas, and debugs server behavior.

The MCP Ecosystem is Growing Fast

MCP was released in November 2024 and has seen rapid adoption. As of early 2025, Claude Desktop, Cursor, Zed, Windsurf, and Cline all support MCP natively. The community has built 100+ MCP servers for services like GitHub, Slack, PostgreSQL, Notion, Linear, and many more. Check the MCP Servers Repository above for the latest community contributions.