← Blog

Why agent costs explode: the quadratic context tax

3 min readAILLMcostagents

Most teams budget for LLM features like they budget for an API call: price per request, multiply by traffic, done. Then they ship an agent — something that calls tools, reads results, and decides what to do next — and the bill arrives three to ten times higher than the spreadsheet said.

The gap is almost always the same thing. It isn't a pricing surprise. It's the shape of the workload.

A single agent request is not a single model call

When an agent uses tools, one user request fans out into a loop:

  1. The model reads the prompt and decides to call a tool.
  2. Your code runs the tool and appends the result to the conversation.
  3. The model reads the whole conversation again — original prompt, its own previous output, and the tool result — and decides what to do next.
  4. Repeat until it answers.

Three tool-calls isn't one model call. It's four — and crucially, each call re-sends everything that came before it. LLMs are stateless; the only way the model "remembers" step two at step four is that you pay to send step two's tokens again.

The math

Let b be your base prompt size and t the number of tool-calls. Each turn re-sends the prompt plus everything generated so far. Total input tokens for one request land around:

input ≈ b·(t+1)  +  (output + tool_result)·(t·(t+1)/2)

That second term is the tax. The t·(t+1)/2 is the sum 1 + 2 + … + t — it grows with the square of the tool-calls, not linearly. Double the tool-calls and you roughly quadruple the context you pay for.

A concrete example: a 2,000-token prompt, 800-token answers, 6 tool-calls on a premium model isn't "6× a simple call." It's tens of thousands of input tokens per request, and at frontier-model input prices it can cross six figures a month at modest traffic.

You can plug your own numbers into the AI Agent Cost Calculator — change the tool-call slider and watch the monthly figure move non-linearly. That curve is the whole point.

Why it's easy to miss

  • Demos hide it. A demo runs one happy-path request with one tool-call. The quadratic term is invisible until tool-calls climb in production.
  • Output tokens look scary, input tokens are the bill. Teams optimize response length. But on agents, re-sent input context usually dominates — often 3–5× the output cost.
  • RAG stacks on top. Retrieval injects documents into the base prompt b, so every one of those re-sends gets heavier too.

Cutting the tax

You rarely need a cheaper model. You need a smaller, smarter loop:

  • Prompt caching. Providers will cache a stable prefix (system prompt, tools, retrieved context) so re-sends bill at a fraction of the input price. This is the single biggest lever for agents and it targets exactly the term that's exploding.
  • Fewer, fatter tools. Five narrow tools that each need a round-trip cost more turns than one tool that returns what the model actually needs. Every turn you remove is removed from the squared term.
  • Trim what you re-send. Summarize or drop stale tool results instead of carrying the full transcript to the final turn. Cap the loop.
  • Route by difficulty. Use a small model for the routing/tool-selection turns and the expensive model only for the final synthesis.
  • Batch and pre-compute. Embeddings and any deterministic steps don't belong in the hot agent loop.

In practice, caching plus a tighter loop routinely takes a five-figure monthly estimate down by more than half — without changing what the product does.

The takeaway

Agent cost isn't "tokens × requests." It's a loop whose context grows with the square of its tool-calls. Model that curve before you ship, design the loop to keep t small, and cache the prefix that gets re-sent. Get those right and the scary number on the calculator becomes a line item you can defend.

Want a real workload pressure-tested? That's exactly what a production-readiness audit digs into.

Working through something like this? I help teams ship AI and cloud systems that hold up — and cost what they should.