free_tool
Are your agent's tools production-ready?
An agent is only as reliable as the tools you hand it. Two tools that read the same, a parameter with no description, a delete tool the model can be talked into calling: those don't show up in a demo. Paste your definitions and get a graded report on the gaps, with the specific fix for each.
Accepts Anthropic, OpenAI, Gemini, or MCP tools/list JSON. Runs entirely in your browser. Nothing is uploaded or stored.
Tool-set readiness
61/100
1 to fix · 6 warnings · 4 passed
5 tools · 6 params · ~213 tokens/turn · Anthropic
Grade D, score 61 out of 100, 1 to fix, 6 warnings, 4 passed, across 5 tools.By tool · worst first
High-impact tools state a guardrail
highsafetydelete_user, run_sqlDestructive tool(s) with no guardrail in the contract: delete_user, run_sql. A tool that deletes, pays, deploys, or runs code is the target a prompt-injection payload steers toward. If neither the description nor the schema mentions confirmation, a dry-run, or a scoped blast radius, one bad tool call does real damage.
Build the brake into the contract: add a require_confirmation or dry_run parameter, say in the description that the action is irreversible and needs explicit user approval, and keep destructive tools out of any autonomous loop.Tools are distinguishable from one another
highselectionsearch_docs ↔ find_documentsThese pairs look interchangeable by name or description: search_docs ↔ find_documents. When two tools read the same, the model picks between them at random, which surfaces as flaky, hard-to-reproduce behavior.
Merge the duplicates into one tool, or sharpen each description to name the case it owns and say which to prefer when both seem to fit.Descriptions say when to reach for the tool
mediumselectionsearch_docs, find_documents, delete_user, run_sql +1 moreMost descriptions state what the tool is but not when to use it (search_docs, find_documents, delete_user, run_sql +1 more). A model selects far more reliably when each description includes an explicit trigger ("Use this when the user asks to ...").
Add a trigger clause to each description: "Use this when the user wants to <task>." Lead with the trigger if the tool is easy to confuse with another.Parameters carry descriptions
mediumschemasearch_docs, find_documents, delete_user, run_sql +1 more6 of 6 parameters (including nested fields) have no description, in search_docs, find_documents, delete_user, run_sql +1 more. An undescribed parameter is one the model fills by guessing the format, which is where malformed tool calls come from.
Give every parameter a description with its format and an example, e.g. "ISO-8601 date, like 2026-01-31". The model only knows what you tell it.Schemas declare which parameters are required
mediumschemasearch_docs, find_documents, run_sql, set_statusNo "required" array on: search_docs, find_documents, run_sql, set_status. With nothing marked required, the model treats every argument as optional and routinely omits the ones the tool actually needs, so the call fails at the boundary.
List the mandatory parameters: "required": ["id", "amount"]. On OpenAI strict mode every property must be required, so make truly optional ones nullable instead.No unconstrained pass-through tools
mediumsafetyrun_sqlPass-through tool(s) that hand the model a single free-text field of code, SQL, or shell: run_sql. A run_sql(query) or exec(command) tool has the widest possible blast radius and is the easiest to weaponize through injection, since the model writes the payload itself.
Replace the free-text pass-through with narrow, intent-specific tools (get_orders_by_status instead of run_sql), or constrain it to a vetted allow-list of operations.Closed-set parameters are constrained
lowschemaset_status.statusfree-text where an enum belongs (set_status.status). An unconstrained field invites a value the backend then rejects, turning a tool call into a retry loop.
Add "enum": ["a","b"] to fields with a fixed set of values, and set additionalProperties: false so the model can't invent extra arguments.Tool names are valid and unique
formatAll toolsEvery tool has a unique name in the accepted character set, so the model can address each one unambiguously.
Every tool has a real description
selectionAll toolsEvery tool carries a substantive description, which is what the model reads to decide whether a tool fits the request.
Tool count fits the model's selection budget
costAll tools5 tools, which is within the range a model can choose from reliably.
Tool block stays within a sane token budget
costAll toolsRoughly 213 tokens (estimated) for the whole tool block. That's a reasonable standing cost to re-send each turn.
Clean schemas are the floor, not the ceiling. Which tools the model can reach in an autonomous loop, how you fence a destructive call behind confirmation, and the evals that catch a wrong tool call before a user does: that's where agents actually break in production. That's the kind of review I do.
Get your agent production-ready: book a callHeuristic static analysis of the tool definitions only. It reads the names, descriptions and schemas you paste; it can't see your tool handlers or what the tool actually does. A clean grade means the obvious, commonly-missed gaps are covered, not that the tool set is correct for your task. Runs entirely in your browser and uploads nothing.
why_it_matters
The model picks tools from the descriptions you wrote
A tool-calling agent never sees your code. It chooses what to call from the names, descriptions and schemas in the tool block, and nothing else. When two tools read the same, it picks one at random. When a parameter has no description, it guesses the format. When a destructive tool carries no guardrail, it's the first thing a prompt-injection payload reaches for. None of that surfaces on the happy path.
This auditor grades the whole set the way the model reads it: are the tools distinguishable, are the parameters described and constrained, do the high-impact tools state a brake, and is the block small enough to re-send on every turn without taxing the context window. So the gaps that turn into flaky, hard-to-reproduce agent behavior get caught before they ship.
New to the mechanism? Start with what tool calling is, or read the 6 ways an agent's tools fail, each with a before/after fix.
faq
Questions & answers
- What does the MCP & Agent Tool Auditor check?
- It grades a whole set of tool definitions across four areas: selection (every tool has a real description, the descriptions say when to use the tool, and no two tools read as interchangeable), schema (parameters carry descriptions, required parameters are declared, and closed-set fields are constrained), safety (destructive or high-impact tools state a guardrail, and there are no unconstrained code/SQL/shell pass-through tools), and cost (the tool count fits the model's selection budget and the serialized tool block stays within a sane per-turn token tax). Each finding names the offending tools and gives a concrete fix.
- Which tool formats does it accept?
- Anthropic tools (name, description, input_schema), OpenAI functions (type function with a function wrapper and parameters), Google Gemini functionDeclarations, and an MCP tools/list result (name, description, inputSchema). It also reads the common envelopes: a bare array of tools, a request body with a tools array, or a single tool object. It auto-detects which shape you pasted and reports it.
- Why does it flag two of my tools as interchangeable?
- Because their names normalize to the same thing (get_user and fetch_user, say) or their descriptions share most of their content words. The model chooses a tool from its name and description alone, so when two read the same it picks between them at random. That shows up as flaky, hard-to-reproduce behavior. The fix is to merge the duplicates or sharpen each description to name the case it owns.
- How does the destructive-tool guardrail check work?
- It flags a tool whose name or description involves a high-impact action (delete, drop, exec, run SQL or shell, deploy, transfer, pay, send email) when neither the description nor the schema mentions a brake: a confirmation step, a dry-run, an irreversibility warning, or a scoped/reversible action. A destructive tool with no guardrail is the target a prompt-injection payload steers toward, so the auditor treats that as a hard fail. Adding a confirmation or dry-run signal clears it.
- Is this a guarantee my tools are correct?
- No. It is a heuristic static analysis of the definitions only. It reads the names, descriptions and schemas you paste; it cannot see your tool handlers, what each tool actually does, or your runtime guardrails. A clean grade means the obvious, commonly-missed gaps are covered, not that the tool set is correct for your task. Treat it as a fast review checklist, then test against your real cases.
- Are my tool definitions uploaded or stored anywhere?
- No. The whole audit runs in your browser with no network call, so the definitions you paste, including internal tool names and descriptions, never leave the page and are never logged. It is safe to audit a production tool set here.
Want the whole agent looked at, not just the tool schemas?
The schemas are the floor. I'll review which tools the loop can reach autonomously, how you fence destructive calls behind confirmation, idempotency on the side effects, and the evals that catch a wrong tool call before users do. Book a call, or leave your email.
Prefer proof first? See how this plays out in real case studies →