Skip to content

Destructive tools with no guardrail

A tool that deletes, pays, deploys, or runs code, with nothing in its contract about confirmation or scope, is the exact target a prompt-injection payload steers toward. One bad or injected call does irreversible damage.

see_it · fix_it

The failure, then the fix

Each verdict below is the actual MCP & Agent Tool Auditor run on the snippet, not a description of one.

before
[
  {
    "name": "delete_account",
    "description": "Delete a user account and all of its data by id. Use this when an admin requests account removal.",
    "input_schema": {
      "type": "object",
      "properties": { "account_id": { "type": "string", "description": "The account UUID" } },
      "required": ["account_id"]
    }
  }
]

Fails · auditor verdictDestructive tool(s) with no guardrail in the contract: delete_account. A tool that deletes, pays, deploys, or runs code is the target a prompt-injection payload steers toward. If neither the description nor the schema mentions confirmation, a dry-run, or a scoped blast radius, one bad tool call does real damage.

after
[
  {
    "name": "delete_account",
    "description": "Delete a user account and all of its data by id. This action is irreversible and requires explicit user confirmation before it runs; never call it inside an autonomous loop. Use this only when an admin has confirmed account removal.",
    "input_schema": {
      "type": "object",
      "properties": {
        "account_id": { "type": "string", "description": "The account UUID" },
        "confirm": { "type": "boolean", "description": "Must be true, set only after the user explicitly confirmed the deletion" }
      },
      "required": ["account_id", "confirm"]
    }
  }
]

Passes · auditor verdictHigh-impact tools mention a brake (confirmation, dry-run, or a scoped/reversible action), so a single stray call can't quietly cause irreversible damage.

fix · Build the brake into the contract: add a require_confirmation or dry_run parameter, say in the description that the action is irreversible and needs explicit user approval, and keep destructive tools out of any autonomous loop.

why_it_matters

Models don't reliably separate instructions from data, so any text an agent reads (a web page, an email, a tool result) can try to hijack it. The damage that hijack can do is bounded by what the tools allow. A delete_account or transfer_funds tool that an agent can call autonomously, with no confirmation step and no scoping in its contract, is the highest-value thing an injected instruction can reach. This is the blast-radius problem, and it's a design choice in the tool definition, not just the prompt.

Build the brake into the contract. Add a require_confirmation or dry_run parameter, state in the description that the action is irreversible and needs explicit user approval, scope it as tightly as possible, and keep destructive tools out of any fully autonomous loop. The auditor fails a tool whose name or description involves a high-impact action when neither the description nor the schema mentions any guardrail.

delete_account()injection's favorite targetconfirmation + scope in the contract

faq

Questions & answers

How do you protect an AI agent from prompt injection on destructive tools?
You cannot fully prevent injection, so you cap what a successful one can do. Keep destructive tools out of autonomous loops, require explicit confirmation for irreversible actions, scope each tool as tightly as possible, validate arguments in code rather than trusting the model, and add a dry_run path. The guardrail belongs in the tool's contract, not only in the prompt.
Should a delete or payment tool be available to an autonomous agent?
Only behind a brake. A tool that deletes data, moves money, or runs code is the first thing an injected instruction targets, so it needs a confirmation step, tight scope, and ideally a human in the loop for the irreversible case. If it must be autonomous, make it idempotent and reversible and log every call.

Spotting one failure is easy. Hardening the whole agent is the work.

I review which tools the loop can reach autonomously, how you fence destructive calls behind confirmation, idempotency on the side effects, and the evals that catch a wrong tool call before users do. Book a call, or leave your email.

Book a call

No spam. You'll get a reply from me.

Prefer proof first? See how this plays out in real case studies →