Most teams budget for LLM features like they budget for an API call: price per request, multiply by traffic, done. Then they ship an agent — something that calls tools, reads results, and decides what to do next — and the bill arrives three to ten times higher than the spreadsheet said.
The gap is almost always the same thing. It isn't a pricing surprise. It's the shape of the workload.
A single agent request is not a single model call
When an agent uses tools, one user request fans out into a loop:
- The model reads the prompt and decides to call a tool.
- Your code runs the tool and appends the result to the conversation.
- The model reads the whole conversation again — original prompt, its own previous output, and the tool result — and decides what to do next.
- Repeat until it answers.
Three tool-calls isn't one model call. It's four — and crucially, each call re-sends everything that came before it. LLMs are stateless; the only way the model "remembers" step two at step four is that you pay to send step two's tokens again.
The math
Let b be your base prompt size and t the number of tool-calls. Each turn re-sends the prompt plus everything generated so far. Total input tokens for one request land around:
input ≈ b·(t+1) + (output + tool_result)·(t·(t+1)/2)
That second term is the tax. The t·(t+1)/2 is the sum 1 + 2 + … + t — it grows with the square of the tool-calls, not linearly. Double the tool-calls and you roughly quadruple the context you pay for.
A concrete example: a 2,000-token prompt, 800-token answers, 6 tool-calls on a premium model isn't "6× a simple call." It's tens of thousands of input tokens per request, and at frontier-model input prices it can cross six figures a month at modest traffic.
You can plug your own numbers into the AI Agent Cost Calculator — change the tool-call slider and watch the monthly figure move non-linearly. That curve is the whole point.
Why it's easy to miss
- Demos hide it. A demo runs one happy-path request with one tool-call. The quadratic term is invisible until tool-calls climb in production.
- Output tokens look scary, input tokens are the bill. Teams optimize response length. But on agents, re-sent input context usually dominates — often 3–5× the output cost.
- RAG stacks on top. Retrieval injects documents into the base prompt
b, so every one of those re-sends gets heavier too.
Cutting the tax
You rarely need a cheaper model. You need a smaller, smarter loop:
- Prompt caching. Providers will cache a stable prefix (system prompt, tools, retrieved context) so re-sends bill at a fraction of the input price. This is the single biggest lever for agents and it targets exactly the term that's exploding.
- Fewer, fatter tools. Five narrow tools that each need a round-trip cost more turns than one tool that returns what the model actually needs. Every turn you remove is removed from the squared term.
- Trim what you re-send. Summarize or drop stale tool results instead of carrying the full transcript to the final turn. Cap the loop.
- Route by difficulty. Use a small model for the routing/tool-selection turns and the expensive model only for the final synthesis.
- Batch and pre-compute. Embeddings and any deterministic steps don't belong in the hot agent loop.
In practice, caching plus a tighter loop routinely takes a five-figure monthly estimate down by more than half — without changing what the product does.
The takeaway
Agent cost isn't "tokens × requests." It's a loop whose context grows with the square of its tool-calls. Model that curve before you ship, design the loop to keep t small, and cache the prefix that gets re-sent. Get those right and the scary number on the calculator becomes a line item you can defend.
Want a real workload pressure-tested? That's exactly what a production-readiness audit digs into.