Giving an agent the wrong tools for a task causes more failures than giving it the wrong instructions. A well-prompted agent with poor tools will fail reliably. A loosely-prompted agent with the right tools will muddle through.
Here is how to decide which tool type fits which situation, and how to design clean boundaries between them.
—
The three tool categories and their tradeoffs
Web search is the right tool when the information is external, current, and not something you can predict in advance. Its weakness is reliability: results vary, pages go down, and the agent has to parse unstructured content. Use it for research tasks, news, and anything where the answer might have changed recently.
Local files are the right tool when the data is known, structured, and under your control. Reading and writing files is fast, deterministic, and cheap. The weakness is staleness: if the file is out of date, the agent will not know. Use files for configuration, memory, templates, cached data, and anything the agent itself has written in a previous step.
External APIs are the right tool when you need to take an action or retrieve data from a specific system: send a message, create a calendar event, query a database, call a service. APIs are the most powerful category and the most dangerous. A failed or misdirected API call can have real-world consequences that are hard to undo. Design the boundaries here with the most care.
—
How agents decide which tool to call
Agents pick tools based on descriptions. The model reads the tool name and description, matches it against the task, and calls what seems most relevant.
This means two things:
Tool descriptions are instructions. Vague descriptions produce unreliable tool selection. A tool called search with description “searches for information” will get called for everything. A tool called web_search with description “searches the live web for current information not available locally” will get called only when the task genuinely needs external current data.
Too many tools degrades performance. When an agent has 20+ tools, selection quality drops noticeably. The model spends more of its reasoning budget on tool selection and less on the actual task. A practical ceiling for most agents is 8-12 tools. If you have more, group related tools under a dispatcher or prune the ones that are rarely used.
—
Tool granularity: broad vs specific
The question of whether to give an agent one broad tool or several specific ones comes up constantly. Both approaches have real tradeoffs.
One broad tool (e.g., a single file_manager that can read, write, list, and delete) is simpler to maintain and gives the agent flexibility. The downside: it is harder to restrict, harder to log at a useful level of detail, and harder to debug when something goes wrong.
Many specific tools (e.g., read_file, write_file, list_directory, delete_file as separate tools) give you fine-grained control. You can expose read_file to an untrusted agent without exposing delete_file. You can log writes separately from reads. The downside: the tool list grows, descriptions multiply, and selection gets harder.
A useful middle ground: group by risk level, not by function. Read operations in one tool, write operations in another, destructive operations in a third. This gives you permission control without an explosion of tool count.
—
Error handling at the tool layer
How a tool fails matters as much as what it does when it succeeds.
Agents can only recover from errors they can understand. If your tool throws an unhandled exception and the agent receives a stack trace, it will either retry blindly or give up. If your tool returns a structured error with a clear message, the agent has something to reason about.
The pattern that works:
{
"ok": false,
"error": "FILE_NOT_FOUND",
"message": "No file at path /data/config.json",
"suggestion": "Check the path or create the file first"
}
Three fields matter: a machine-readable error code, a human-readable message, and optionally a suggestion for what to do next. The suggestion field is underrated. It gives the agent a recovery path without requiring it to reason from scratch.
Additionally:
- Always return something. Never let a tool return undefined or null on failure. The agent will misinterpret silence.
- Distinguish retryable from non-retryable errors. A rate limit error is retryable. A permission denied error is not. If the agent cannot tell the difference, it will retry both or retry neither.
- Cap side effects on failure. If a tool fails partway through a multi-step operation, return what succeeded and what did not. Partial information is better than none.
—
A design checklist before adding a new tool
Before adding any tool to an agent:
- Can the agent do this task with an existing tool? If yes, skip the new one.
- Is the description specific enough that the agent will only call it when appropriate?
- What is the worst-case outcome if this tool is called with bad arguments? Is that acceptable?
- What does the tool return on failure, and can the agent recover from it?
- Does adding this tool push the total over 12? If so, what gets removed or grouped?
The last question is the one most people skip. Every tool you add has a cost in selection overhead and prompt complexity. The best tool sets are small, specific, and boring.
What tool design patterns have you found most reliable in practice? Curious whether others have hit the too-many-tools problem and how you solved it.