Tool Use
& MCP
Academy
Master tool design for AI agents and the Model Context Protocol. From function calling fundamentals to production MCP servers — give your agents the tools they need to act in the real world.
Tool Use Fundamentals
How LLMs call functions and use results
Tool use (also called function calling) is the mechanism that lets LLMs interact with the outside world. Instead of only generating text, the model can output structured requests to call functions you define — fetching real-time data, executing actions, and accessing systems the model has no direct connection to.
This is what separates a chatbot from an agent. Without tools, a model can only reason over its training data and the text in its context window. With tools, it can check live databases, call APIs, read files, run code, and take actions in the real world.
Every tool interaction follows a 4-step lifecycle. Understanding this loop is essential — it determines where you add validation, error handling, and security controls.
Tool Definition
You define tools with names, descriptions, and JSON Schema parameters. These definitions are injected into the model's context alongside the system prompt.
Each tool definition consumes context tokens. A tool with a complex schema can use 200-500 tokens just for its definition.
Model Decides to Call a Tool
Based on the user's message and available tools, the model outputs a structured tool call instead of a text response. It selects the tool name and generates parameters.
The model returns a special "tool_use" content block with the tool name and arguments as JSON. It can also include text reasoning before the tool call.
Your Code Executes the Tool
Your application receives the tool call, validates the parameters, executes the underlying function (API call, database query, etc.), and captures the result.
This is where validation, sandboxing, error handling, and logging happen. The model never executes code directly — your application is always in the loop.
Result Injected Back into Context
The tool result is sent back to the model as a "tool_result" message. The model then generates its final response using both the original context and the tool output.
Tool results consume context tokens. A large result (raw API response) can waste thousands of tokens. Keep results minimal and structured.
When a task requires multiple independent pieces of information, the model can emit multiple tool calls in a single response. Your application executes them concurrently and returns all results at once.
Key Principles
The Model Never Executes Code
Tool use is structured output — the model generates JSON describing what to call and with what parameters. Your application handles all execution. This is a fundamental security boundary.
Tool Calls Are Another Turn in the Conversation
A tool call creates a new message pair in the conversation: the assistant's tool_use message and the tool_result response. Both consume context tokens and persist in history.
The Model Sees Tool Definitions as Context
Every tool definition is serialized into the context window alongside the system prompt. 20 tools with complex schemas can consume 5,000-10,000 tokens before any conversation happens.
Tool Results Replace the Model's Knowledge
When a tool returns data, the model treats it as ground truth. This is powerful (real-time data) but dangerous (if the tool returns wrong data, the model will confidently present it).
Key Insight
Tool use is not code execution — it is structured output. The model generates JSON that describes a desired action, and your application decides whether and how to execute it. This distinction is the foundation of safe, reliable agent systems. You are always in the loop.
Designing Good Tools
Clear names, schemas, and descriptions
Tool design is API design for AI models. The same principles that make APIs developer-friendly — clear naming, good documentation, typed schemas — make tools model-friendly. But there are unique considerations: models read descriptions literally, can't ask clarifying questions, and will creatively misuse ambiguous interfaces.
Use verb_noun format
Good
search_orders, create_ticket, get_userBad
handle_data, process, doStuffThe verb tells the model what action to take. The noun tells it what entity is affected. Together they communicate intent unambiguously.
Be specific, not generic
Good
list_open_tickets, get_order_by_idBad
get_items, fetch_data, queryGeneric names force the model to rely on descriptions alone. Specific names provide instant signal for tool selection.
Avoid overlapping names
Good
search_orders (by query), get_order (by ID)Bad
find_orders, search_orders, lookup_ordersOverlapping names cause the model to pick randomly between near-identical tools. Each tool name should be clearly distinguishable.
Use consistent conventions
Good
get_ (read), create_ (write), update_ (modify), delete_ (remove)Bad
Mixing get_, fetch_, retrieve_, find_, lookup_ for readsConsistent prefixes create a predictable vocabulary. The model learns that get_ always means read, create_ always means write.
Tool descriptions are the model's only guide for choosing and using your tools. A good description has these parts:
What it does
“Retrieves order details including items, status, and shipping info.”
When to use it
“Use when the user asks about a specific order and has an order ID.”
When NOT to use it
“Do not use for searching orders — use search_orders instead.”
What it returns
“Returns order status, item list, shipping address, and payment info.”
Edge cases
“Returns an error if the order ID is invalid or the order was deleted.”
Parameter Design
Add descriptions to every parameter
Use enums for constrained values
Set bounds on numbers and arrays
Mark required vs optional clearly
Each tool should represent one logical action at the right level of abstraction. Too fine forces unnecessary orchestration. Too coarse creates a confusing mega-tool.
open_db_connection, write_sql_query, execute_query, parse_results, close_connectionForces the model to orchestrate low-level steps it shouldn't need to know about. 5 tool calls instead of 1.
search_orders(query, filters) - handles connection, query, parsing internallyOne tool call, one logical action. Internal complexity is hidden behind a clean interface.
manage_database(action, table, data, query, filters, sort, limit, offset, join, ...)Swiss-army-knife tool with 15 parameters. Model has to figure out which combination to use. Errors are hard to diagnose.
The Output Rule
Tool outputs go back into the context window. Every unnecessary field, every raw API dump, every untruncated list competes for the model's attention and eats your token budget. Return only what the model needs to formulate its response — typically 10-20% of the raw data. A well-designed tool output is the highest-leverage context optimization you can make.
Model Context Protocol (MCP)
The standard for agent-tool communication
The Model Context Protocol (MCP) is an open standard that defines how AI applications connect to external tools and data sources. Released by Anthropic in November 2024, MCP replaces the fragmented landscape of custom tool integrations with a single, universal protocol.
Think of it like USB-C for AI: before MCP, every AI app needed custom connectors for every tool. With MCP, you build one server and it works with every MCP-compatible client — Claude Desktop, Cursor, Zed, Windsurf, and any custom application.
“MCP is an open protocol that standardizes how applications provide context to LLMs. Think of MCP like a USB-C port for AI applications.”
David Soria Parra
Creator of MCP, Anthropic
“By giving Claude access to tools, you can extend its capabilities far beyond text generation. Tools let Claude interact with external systems, fetch real-time data, and take actions in the world.”
Anthropic
Tool Use Documentation
“The quality of your tools defines the ceiling of your agent's capabilities. A model can reason perfectly and still fail if its tools are poorly designed.”
Alex Albert
Prompt Engineering Lead, Anthropic
N x M Integration Problem
Every AI application needed custom code for every data source. 5 AI apps and 5 tools meant 25 separate integrations to build and maintain.
No Discovery Mechanism
Tools were hardcoded. There was no way for a client to ask a server 'What tools do you have?' — you had to know the API in advance.
Inconsistent Schemas
Every tool server defined parameters differently. No standard for describing tool inputs, outputs, or error formats.
No Capability Negotiation
Clients couldn't check if a server supported the features they needed. Version mismatches caused silent failures.
Architecture
MCP uses a client-server architecture with three layers: hosts that run AI applications, clients that manage protocol connections, and servers that expose capabilities.
MCP Hosts
Applications that want to access tools and data via MCP. Examples: Claude Desktop, IDE extensions, custom AI applications. The host creates and manages MCP client instances.
Claude Desktop, Cursor, Zed, Windsurf, custom apps
MCP Clients
Protocol clients that maintain 1:1 connections with MCP servers. Each client connects to one server and handles the JSON-RPC communication protocol. Created by hosts.
Built into the host — one client per server connection
MCP Servers
Lightweight programs that expose tools, resources, and prompts through the standard MCP protocol. Each server typically wraps one data source or service.
GitHub server, PostgreSQL server, filesystem server, Slack server
MCP servers expose three types of capabilities. Each has a different control flow — understanding who initiates each is key.
Tools
Model-initiatedFunctions the model can call. Each tool has a name, description, and parameter schema. The server defines them; the model invokes them through the client.
search_issues, run_query, send_messageResources
Application-initiatedData sources the client can read, like files or database records. Resources are identified by URIs and can be watched for changes. Better than tools for frequently accessed read-only data.
file:///config.json, db://users/123, git://repo/mainPrompts
User-initiatedReusable prompt templates that the server provides. Useful for standardized interactions — the server knows the best way to phrase requests for its domain.
summarize_pr, explain_error, generate_migrationInitialize
Client sends 'initialize' with its protocol version and capabilities. Server responds with its own version, capabilities, and server info. Both sides learn what the other supports.
Capability Discovery
Client queries available tools (tools/list), resources (resources/list), and prompts (prompts/list). Server returns schemas for each. Client now knows everything the server offers.
Operation
Client calls tools (tools/call), reads resources (resources/read), or renders prompts (prompts/get). Server executes and returns results. Notifications flow both ways for events like resource changes.
Shutdown
Either side can close the connection cleanly. The transport layer (stdio pipe or SSE connection) is closed gracefully.
Transport Layers
stdio (Standard I/O)
Communication over stdin/stdout. The client launches the server as a child process. Simple, secure (no network exposure), and the default for local tools.
SSE (Server-Sent Events)
HTTP-based transport using Server-Sent Events for server-to-client messages and HTTP POST for client-to-server messages. Enables remote servers.
Key Insight
MCP turns the N x M integration problem into N + M. Instead of every AI app building custom connectors for every tool, tool authors build one MCP server and app developers add one MCP client. A new tool instantly works with every MCP-compatible application.
Building MCP Servers
Create tools any MCP client can use
Building an MCP server means creating a program that exposes tools, resources, and prompts through the standard MCP protocol. The official TypeScript SDK handles protocol details — you focus on the tools themselves.
An MCP server is typically a small, focused program: 100-500 lines for a well-scoped server. It wraps a single data source or service (GitHub, a database, Slack) and makes it available to any MCP client.
Building a Server, Step by Step
Create an McpServer instance with a name and version. The name helps clients identify your server; the version enables capability negotiation.
server.tool() registers a tool with name, description, Zod schema for parameters, and an async handler. The SDK automatically validates inputs against the schema. Return content as an array of typed content blocks.
Resources expose data the client can read without the model having to call a tool. URI-based addressing (github://repo/README.md) makes resources discoverable and cacheable.
StdioServerTransport communicates over stdin/stdout — the client launches your server as a child process. For remote access, use SSEServerTransport over HTTP. The server logic is identical regardless of transport.
Add your server to Claude Desktop by editing the configuration file. Claude will automatically launch the server and discover its tools.
Real-World Server Patterns
Database Query Server
Exposes read-only SQL query execution with schema introspection. Uses parameterized queries to prevent injection. Returns column names and rows as structured data.
Git Repository Server
Provides code search, file reading, branch listing, and diff viewing for a git repository. Resources expose frequently accessed files like README and config.
Notification Server
Sends messages through email and Slack. Requires explicit confirmation for sends (returns preview first). Rate-limited to prevent spam.
MCP Inspector
Official interactive testing tool. Connect your server and manually test each tool, view schemas, and verify error handling. Best for development and debugging.
npx @modelcontextprotocol/inspector your-server.jsUnit Tests with In-Memory Transport
Create an InMemoryTransport pair for automated testing. Call tools programmatically and assert on responses. No external dependencies needed.
vitest run --coverageClaude Desktop Integration
Add your server to Claude Desktop's config for end-to-end testing. Verify the model can discover, select, and correctly use your tools in real conversations.
Edit ~/Library/Application Support/Claude/claude_desktop_config.jsonKey Insight
The best MCP servers are small and focused. A server that wraps GitHub with 5 well-designed tools is better than a “universal” server with 50 tools across 10 services. Clients can compose multiple focused servers. Each server should do one thing well.
Tool Selection & Routing
Dynamic tool loading for large registries
Tool selection is how you decide which tools the model sees for each request. With a small tool set (under 10), you can include all tools every time. But as your registry grows past 20 tools, you need dynamic selection — otherwise the model drowns in definitions and picks the wrong tool.
This is one of the highest-leverage optimizations in tool-using agents. Research from UC Berkeley (Gorilla) and LangChain (Bigtool) shows that tool selection accuracy drops dramatically as tool count increases — from 95% with 10 tools to 30% with 50+.
More tools means more context tokens, more confusion, and worse selection accuracy. These numbers are approximate but reflect consistent findings across multiple research papers.
Selection Strategies
Embed all tool descriptions in a vector database. When a request arrives, embed the user message and find the most similar tool descriptions. Return only the top-k matches.
How It Works
- 1.Index each tool's name + description as a document in a vector DB
- 2.On each request, embed the user message
- 3.Search for top-k most similar tool descriptions (k=5-10)
- 4.Include matched tools in the LLM call alongside core tools
- 5.Model sees only relevant tools, improving selection accuracy
Group tools into categories (billing, support, admin). Use a lightweight classifier to route to the right category first, then provide all tools within that category.
How It Works
- 1.Define tool categories: billing (5 tools), support (8 tools), admin (4 tools)
- 2.Use intent classification to detect the category from the user message
- 3.Load all tools in the matched category
- 4.Optionally use RAG within a category for further refinement
- 5.Fast and predictable — no embedding needed for the routing step
Combine coarse and fine selection. First, use category routing or keyword matching to narrow to 15-20 candidate tools. Then use RAG or a small model to pick the final 5-8.
How It Works
- 1.Stage 1: Fast filter using keywords, categories, or a classifier
- 2.Stage 2: Semantic search over the filtered subset for precise matching
- 3.Always include 'core' tools (think, respond_to_user) regardless of routing
- 4.Cache frequently co-occurring tool sets to skip routing for common queries
- 5.Monitor which tools actually get selected to optimize routing rules
Separate your tools into two groups: core tools that are always available (reasoning, responding) and dynamic tools that are loaded per-request based on relevance.
thinkAlways available. Lets the model reason through complex decisions before acting.
respond_to_userAlways available. Sends a final response to the user.
request_clarificationAlways available. Asks the user for more information when the query is ambiguous.
search_ordersDynamically loaded when the user's message relates to orders, purchases, or shipments.
create_ticketDynamically loaded when the user's message relates to support, issues, or escalation.
Key Insight
Dynamic tool selection is not premature optimization — it is essential architecture for any agent with more than 15-20 tools. Without it, you are paying the full context cost of every tool definition on every request and accepting a 3x degradation in tool selection accuracy. The investment in a tool index pays back on every single request.
Error Handling & Retries
Graceful failure when tools break
Tools call external APIs, query databases, and interact with services that can fail. Error handling determines whether a tool failure crashes the agent, confuses the model, or gets handled gracefully with a recovery path.
The key principle: never return raw errors to the model. Stack traces waste tokens, expose internals, and confuse the model. Instead, return structured error objects that tell the model exactly what went wrong and what to do next.
Error Categories
Validation Errors
RetryableThe model sent invalid parameters — wrong type, missing required field, value out of range. These are the most common tool errors.
Not Found Errors
RetryableThe requested resource does not exist — invalid ID, wrong table name, deleted record.
Permission Errors
Non-retryableThe tool does not have permission to perform the requested action — restricted path, insufficient API scope.
Transient Errors
RetryableTemporary failures — network timeouts, rate limits, service unavailability. The operation might succeed if retried.
Every tool should return a consistent error structure. These fields give the model everything it needs to handle the failure intelligently.
successLets the model (and your code) quickly determine if the tool call worked.
error.codeMachine-readable error category: VALIDATION_ERROR, NOT_FOUND, PERMISSION_DENIED, TIMEOUT.
error.messageHuman-readable explanation the model can relay to the user or use for self-correction.
error.retryableTells the model whether to retry or give up. Prevents infinite retry loops.
error.suggestionSpecific guidance: 'Available tables: users, orders' or 'Try again in 5 seconds.'
Classify the error
Determine if the error is transient (retry) or permanent (fail fast). Rate limits, timeouts, and 503s are transient. Auth failures, 404s, and validation errors are permanent.
Apply exponential backoff
Wait 1s, then 2s, then 4s between retries. Add random jitter (0-500ms) to prevent thundering herd. Cap at 3-5 retries maximum.
Circuit breaker on repeated failures
If a tool fails 5 times in 60 seconds, trip the circuit breaker. Return instant failure for the next 30 seconds instead of waiting for timeouts.
Fallback response
When all retries fail and the circuit is open, return a graceful fallback: cached data, a degraded response, or an explicit 'service unavailable' message.
Circuit Breaker Pattern
A circuit breaker prevents a failing tool from consuming resources on doomed requests. It has three states:
Closed
Normal operation. Tool calls execute as usual. Failure count is tracked.
Open
Failure threshold exceeded. All calls fail immediately with a cached error. No execution attempts.
Half-Open
After cooldown, one test call is allowed. If it succeeds, circuit closes. If it fails, circuit reopens.
Key Insight
The model cannot debug your infrastructure. When a tool fails, the model needs actionable information, not stack traces. Every error response should answer three questions: What happened? Can I retry? What should I do instead? If your errors answer these three questions, the model can recover from most failures without human intervention.
Security & Sandboxing
Safe tool execution in production
Tools give the model real-world agency — the ability to read files, query databases, send messages, and execute commands. This power requires a security model that assumes the model's outputs are untrusted user input.
The model can be manipulated via prompt injection, can hallucinate incorrect parameters, or can misunderstand ambiguous requests. Every tool interaction must pass through validation, sandboxing, and permission checks before execution.
Threat Model
Prompt Injection via Tool Inputs
An attacker crafts user input that causes the model to generate malicious tool parameters — SQL injection, command injection, or path traversal.
Example
User: 'Search for '); DROP TABLE users; --'Mitigation
Validate all tool inputs with Zod schemas. Use parameterized queries for SQL. Never pass model-generated strings directly to shell commands or eval().
Data Exfiltration via Tool Results
A compromised or poorly designed tool returns sensitive internal data (API keys, session tokens, credentials) that the model then includes in its response to the user.
Example
Tool returns raw API response containing internal_api_key fieldMitigation
Filter tool outputs at the boundary. Define explicit response types that include only user-facing fields. Strip sensitive data before returning to the model.
Privilege Escalation
A tool runs with the application's full permissions. The model (or a prompt injection) exploits this to access resources the user should not have access to.
Example
File tool with no path restrictions reads /etc/passwdMitigation
Apply least privilege: read-only DB connections for read tools, scoped API keys, sandboxed file access. Create separate service accounts per tool category.
Denial of Service via Tool Abuse
The model enters an infinite retry loop, calls a tool thousands of times, or triggers expensive operations that consume resources or rack up API costs.
Example
Model retries a failing API call 500 times in a loopMitigation
Set per-tool rate limits, maximum retries (3-5), execution timeouts (5-30s), and per-session cost budgets. Use circuit breakers for external services.
Tool Confusion Attack
An attacker tricks the model into calling the wrong tool by carefully crafting input that matches one tool's description more closely than the intended tool.
Example
Input crafted to trigger delete_account instead of get_account_infoMitigation
Require confirmation for destructive operations. Use distinct, non-overlapping tool descriptions. Implement undo/soft-delete instead of hard deletes.
Each layer catches a different class of attack. All four should be applied to tools that interact with system resources.
Schema Validation (Zod)
Validate types, constraints, enums, and patterns on every parameter before execution. Catch malformed inputs at the gate.
Input Sanitization
Escape or reject inputs that could be interpreted as commands. Prevent SQL injection, shell injection, and path traversal.
Path Normalization
Resolve and validate file paths to prevent directory traversal. Reject paths outside the allowed directory.
Command Allowlisting
Maintain an explicit list of allowed commands. Reject anything not on the list. Never use a blocklist — attackers will find what you missed.
Permission Model
Classify every tool into a permission tier based on its potential impact. Each tier has different execution policies.
Read-Only
Safe to auto-execute. Cannot modify state. Uses read-only database connections and API keys with read scopes.
Write with Confirmation
Creates or modifies data. Returns a preview and requires explicit confirmation before executing. Idempotent where possible.
Destructive (Human Required)
Irreversible actions. Always requires human approval in the loop. Cannot be auto-confirmed. Logged with full audit trail.
Every tool call should be logged in an append-only audit trail. This is essential for debugging, security investigations, cost tracking, and compliance.
timestampWhen the tool was called (ISO 8601)
tool_nameWhich tool was invoked
parametersInput parameters (sanitized — no secrets)
result_statussuccess | error | timeout | denied
duration_msHow long the execution took
caller_idWhich user/session triggered the call
model_idWhich model made the tool call
cost_tokensTokens consumed for this tool interaction
Key Insight
Treat every tool call as if it came from an untrusted external user — because in a real sense, it did. The model's outputs are influenced by user input, training data, and potentially adversarial prompts. Validate, scope, sandbox, and audit every tool interaction. The security boundary is not between the user and the model — it is between the model and your tools.
Production Tool Architecture
Versioning, monitoring, and scaling tools
Production tool architecture goes beyond writing individual tools. It encompasses versioning (so tools can evolve without breaking clients), monitoring (so you know when tools degrade), scaling (so tools handle production load), and governance (so your tool registry stays manageable).
Versioning Strategy
Schema Versioning
When you change a tool's parameters (add required fields, remove fields, change types), you must version the schema. Clients that depend on the old schema will break otherwise.
Deprecation Workflow
Before removing a tool, mark it as deprecated. Emit warnings in responses. Set a sunset date. Remove only after all clients have migrated.
Backward Compatibility
New optional parameters should have defaults. New response fields should be additive (never remove existing fields in a minor version).
Track these metrics per tool. Each category has different targets and alert thresholds.
p95 > 3s for 5 minutesMedian tool execution time
95th percentile — catches slow outliers
99th percentile — worst case scenarios
Error rate > 5% for 2 minutesPercentage of tool calls that fail
Model sending invalid params
Calls exceeding timeout threshold
Cost/hour > 2x baselineTool invocation volume for capacity planning
Context tokens consumed per tool interaction
Total cost including API calls and compute
Deployment Patterns
MCP server runs alongside the main application as a separate process. Communicates via stdio or local network. Simple to deploy and manage.
MCP server deployed as an independent service with its own lifecycle, scaling, and monitoring. Accessed via SSE transport over HTTP.
A central MCP gateway aggregates multiple MCP servers behind a single endpoint. Handles routing, auth, rate limiting, and tool selection centrally.
Stateless Tool Servers
Design tool servers to be stateless where possible. Store session state externally (Redis, database). This enables horizontal scaling — add more instances behind a load balancer.
Connection Pooling
Tools that connect to databases or APIs should use connection pools. A tool server handling 100 concurrent requests should not open 100 database connections.
Caching Tool Results
Cache frequently requested, slowly changing data. A tool that fetches company policies does not need to hit the database on every call. Use TTL-based caching with cache invalidation.
Rate Limiting by Tool
Different tools have different cost profiles. A search tool is cheap; a payment processing tool is expensive and rate-sensitive. Apply per-tool rate limits that match the underlying service constraints.
Graceful Degradation
When a tool server is overloaded, shed load gracefully. Return cached results, queue requests, or return a 'service busy' error with retry guidance — never drop requests silently.
A tool registry is the source of truth for all available tools. Each entry should include metadata for discovery, governance, and operations.
nameUnique tool identifier
search_ordersversionSemantic version of the tool schema
2.1.0ownerTeam or individual responsible
commerce-teamstatusLifecycle state
active | deprecated | sunsetdescriptionModel-facing description for tool selection
Search orders by email, date, or status...schemaZod schema for input validation
z.object({ query: z.string(), ... })metricsReal-time usage and performance data
{ calls_24h: 15420, p95_ms: 340, error_rate: 0.2% }dependenciesExternal services this tool requires
['postgres', 'stripe-api']Key Insight
Tools are infrastructure, not just code. They need the same operational discipline as any production service: versioning, monitoring, alerting, capacity planning, and governance. A tool that works in demos but has no monitoring, no versioning, and no ownership is technical debt waiting to become an incident.
Interactive Examples
See tool use patterns in action with live code
See tool use patterns in action. Each example shows a bad pattern and its production-ready fix. Toggle between them to understand the difference.
Clear names and descriptions are critical for tool selection
// BAD: Ambiguous tool names with no descriptions
const tools = [
{
name: "do_stuff",
description: "Does stuff with the database",
parameters: {
type: "object",
properties: {
data: { type: "string" },
},
},
},
{
name: "handle_request",
description: "Handles a request",
parameters: {
type: "object",
properties: {
input: { type: "string" },
},
},
},
];
// Model has no idea which tool to use or what "data" means
const response = await llm.generate({
tools,
messages: [{ role: "user", content: "Look up order #1234" }],
});Why this fails
Vague names like 'do_stuff' and generic descriptions give the model no signal for when to use which tool. Parameter names like 'data' and 'input' don't communicate expected formats or constraints.
All Examples Quick Reference
Tool Naming & Descriptions
Clear names and descriptions are critical for tool selection
Tool Output Design
What your tools return goes directly into the context window
Schema Validation with Zod
Validate tool inputs and outputs at runtime
MCP Server Implementation
Building an MCP server with the TypeScript SDK
Tool Error Handling
Design errors that help the model recover
Dynamic Tool Selection
Load only the tools relevant to the current task
Tool Execution Sandboxing
Isolate tool execution to prevent damage
Anti-Patterns & Failure Modes
Tool soup, leaky tools, and how to avoid them
Knowing what not to do is as important as knowing what to do. These are the most common failure modes in tool-using AI systems, drawn from production experience and research on function calling accuracy.
Registering dozens of tools with overlapping functionality, forcing the model to choose between near-identical options.
Cause
Adding tools incrementally without auditing for overlap. Three tools that all 'search records' — searchDB, querySQL, findRecords — each with slightly different schemas.
Symptom
Model picks the wrong tool 40%+ of the time. Wastes tokens deliberating between similar options. Tool call chains become unpredictable. Research shows accuracy drops from 95% to 30% going from 10 to 50 tools.
Fix
Audit your tool registry for overlaps. Merge similar tools into one with optional parameters. Keep under 15 tools per task context. Use dynamic tool selection (RAG over tool descriptions) for large registries.
Tool parameters with vague names, missing descriptions, or no type constraints, leaving the model to guess the correct format.
Cause
Lazy parameter naming (data, input, params) with no descriptions. Missing required/optional distinctions. No examples of valid values.
Symptom
Model sends wrong parameter types (string instead of number), invents parameter names, or formats values incorrectly. Tool calls fail silently or produce unexpected results.
Fix
Use Zod or JSON Schema for every parameter. Add descriptions with examples. Mark required fields. Add patterns/enums for constrained values. If a human can't figure out the parameter format from the schema alone, the model can't either.
Tools return entire API responses — metadata, pagination, internal IDs, audit logs — consuming thousands of context tokens with irrelevant data.
Cause
Passing through raw API responses without filtering. Returning full database rows instead of relevant fields. No truncation on list results.
Symptom
Context window fills up fast. Model gets confused by irrelevant fields. Important information is buried in noise. Token costs increase dramatically with no quality improvement.
Fix
Filter tool outputs to only fields the model needs. Truncate lists (return top 5, not all 500). Flatten nested structures. Remove internal IDs, timestamps, and metadata the model won't use. Aim for 90%+ reduction in raw response size.
Tools fail without returning actionable error information, leaving the model unable to recover or explain what went wrong.
Cause
Catching exceptions and returning null, empty strings, or generic 'error occurred' messages. Swallowing errors to avoid 'messy' responses.
Symptom
Model hallucinates results when it receives null. Makes up data to fill gaps. Retries the same failing call indefinitely. User gets confident-sounding wrong answers because the model doesn't know the tool failed.
Fix
Return structured error objects with: error code, human-readable message, whether the error is recoverable, and a specific suggestion for what to try next. Never return null — always return an explicit error or success.
Tools operate with the application's full permissions — database admin, file system access, network access — instead of scoped, minimal privileges.
Cause
Using the same database connection, API keys, and service accounts for tool execution as for the main application. No sandboxing layer between the model and system resources.
Symptom
A single prompt injection or model hallucination can delete data, read sensitive files, or exfiltrate information. Security audit reveals tools have access to resources they never need.
Fix
Apply the principle of least privilege: create read-only database views for read tools, separate API keys with minimal scopes, sandbox file system access to a working directory, set resource limits (timeout, memory), and audit tool permissions regularly.
The model invents tool names, parameters, or call patterns that don't exist in the provided tool definitions.
Cause
Tool names that are too similar to common patterns the model has seen in training data. Insufficient grounding in the tool schema. No validation layer between the model's output and tool execution.
Symptom
Model calls tools that don't exist (e.g., 'send_message' when only 'send_email' is defined). Invents parameters not in the schema. Attempts multi-step tool patterns it saw in training but aren't supported.
Fix
Validate every tool call against the registered schema before execution. Return clear 'tool not found' errors with the list of available tools. Use unique, specific tool names that are unlikely to be confused with training data patterns.
Best Practices Checklist
Production-ready guidelines for tool use and MCP
Production-ready guidelines for building tools and MCP servers, distilled from Anthropic documentation, the MCP specification, and real-world agent engineering experience.
One tool, one action
Each tool should do exactly one thing. A 'manage_user' tool that creates, updates, and deletes is harder for the model to use correctly than separate create_user, update_user, and delete_user tools.
Use verb_noun naming
Name tools as verb_noun: search_orders, create_ticket, get_user. This instantly communicates the action and target. Avoid generic names like 'handle' or 'process'.
Write descriptions for the model, not humans
Tool descriptions are part of the model's context. Include when to use the tool, what it returns, and edge cases. 'Searches orders by email or date range. Use when user wants to find orders but doesn't have an order ID.'
Validate with Zod at the boundary
Parse and validate every tool input with Zod before execution. This catches type errors, missing fields, and out-of-range values before they cause downstream failures.
Keep servers focused and composable
Build small, single-purpose MCP servers (one for GitHub, one for Jira, one for databases) rather than monolithic servers. Clients can compose multiple servers as needed.
Version your server protocol
Include version in your server's capabilities. When you add or change tools, increment the version so clients can detect incompatibilities.
Use resources for read-heavy data
MCP resources (file://, db://) are better than tools for data the client reads frequently. Resources can be cached and subscribed to, while tool calls always execute.
Test with the MCP Inspector
Use the official MCP Inspector tool to test your server independently before connecting to a client. It validates protocol compliance, schema correctness, and error handling.
Never return null — always return structured results
Return explicit success or error objects from every tool. A null return causes the model to hallucinate results. An error object tells the model exactly what went wrong and how to recover.
Include recovery suggestions in errors
When a tool fails, return what the model should try next: 'Table not found. Available tables: users, orders, products.' This turns errors into learning opportunities.
Implement idempotent retries
Design write operations to be idempotent so the model can safely retry on failure. Use unique request IDs to prevent duplicate actions.
Set timeouts on every tool call
A tool that hangs blocks the entire agent loop. Set aggressive timeouts (5-30 seconds) and return a timeout error the model can handle, rather than waiting indefinitely.
Apply least privilege to every tool
Tools should have the minimum permissions needed. Read-only tools get read-only database access. File tools are restricted to a working directory. API keys have minimal scopes.
Validate and sanitize all tool inputs
Never pass model-generated strings directly to system commands, SQL queries, or file paths. Validate with schemas, use parameterized queries, and sanitize file paths.
Log every tool call for audit
Record the tool name, input parameters, output, duration, and caller for every tool execution. This is essential for debugging, cost tracking, and security auditing.
Require human approval for destructive actions
Tools that delete data, send messages, make payments, or modify infrastructure should require explicit human confirmation before execution. Never auto-approve irreversible actions.
Keep active tool sets under 15 per request
Research shows tool selection accuracy drops significantly beyond 20 tools. Use dynamic selection via RAG to keep the active set small while supporting large registries.
Index tool descriptions for semantic search
Embed tool names and descriptions in a vector database. When a request arrives, find the top-k most relevant tools. This scales to hundreds of tools while keeping per-request context small.
Use tool categories for coarse routing
Group tools by domain (billing, search, admin). Route to a category first based on intent classification, then do fine-grained tool selection within that category.
Build evals for tool selection accuracy
Create test suites that verify the model picks the correct tool for known queries. Track selection accuracy as a metric. Regression test when adding new tools to catch description conflicts.
The Guiding Principle
Every tool is a contract between the model and the real world. The model promises to send valid parameters; the tool promises to return useful, safe results. When either side breaks the contract, the agent fails. Design tools that make the contract easy to uphold: clear schemas, structured errors, minimal outputs, and defense in depth.
— Anthropic, Tool Use Documentation
Resources & Further Reading
Docs, specs, repos, and guides
Essential documentation, specifications, repositories, and guides for mastering tool use and the Model Context Protocol.
Tool Use with Claude — Anthropic Documentation
Official documentation on Claude's tool use capabilities including function calling, parallel tool calls, and streaming tool results.
Model Context Protocol — Official Documentation
The official MCP specification, quickstart guides, and architecture overview. The definitive reference for building MCP servers and clients.
Model Context Protocol TypeScript SDK
Official TypeScript SDK for building MCP servers and clients. Includes server framework, transport implementations, and examples.
Introducing the Model Context Protocol
Anthropic's announcement of MCP: why it was created, the problem it solves, and the vision for a universal tool standard.
MCP Servers — Community Directory
Official and community-built MCP servers for GitHub, Slack, PostgreSQL, filesystem, and dozens more integrations.
Gorilla: Large Language Model Connected with Massive APIs
Research on training LLMs to accurately use tools from large API registries. Key findings on tool selection accuracy degradation with scale.
Toolformer: Language Models Can Teach Themselves to Use Tools
Foundational paper on how language models learn to use tools. Demonstrates self-supervised approaches to tool use learning.
Building Effective Agents — Anthropic
Comprehensive guide to building production agents, including tool design principles, orchestration patterns, and common failure modes.
Function Calling — OpenAI Documentation
OpenAI's approach to function calling, useful for understanding cross-platform tool design patterns and the JSON Schema specification.
MCP Inspector — Testing Tool
Official tool for testing MCP servers interactively. Validates protocol compliance, tests tool schemas, and debugs server behavior.
The MCP Ecosystem is Growing Fast
MCP was released in November 2024 and has seen rapid adoption. As of early 2025, Claude Desktop, Cursor, Zed, Windsurf, and Cline all support MCP natively. The community has built 100+ MCP servers for services like GitHub, Slack, PostgreSQL, Notion, Linear, and many more. Check the MCP Servers Repository above for the latest community contributions.