Cutting LLM spend: caching, batching & context reuse

When a team tells me their LLM bill is too high, the fix is almost never "switch to a cheaper model." It's that the same tokens are being paid for over and over. Here are the five levers I reach for first — in order of impact — none of which change what the product does.

1. Prompt caching (the big one)

Providers will cache a stable prefix — system prompt, tool definitions, retrieved documents — and bill re-sends of it at a steep discount. On an agent that loops, the re-sent context is most of the bill, so caching hits exactly the term that's exploding (see the agentic context tax).

The trick is prefix stability: put everything constant at the front of the prompt and everything variable at the back, so the cache key holds across turns. Teams that reorder their prompt for cacheability routinely see the input portion of their bill drop by half or more.

2. Batching

For anything not in a live request path — nightly enrichment, evals, bulk classification, embeddings — use the provider's batch endpoint. You trade latency (results come back within a window) for a meaningful discount. Most teams have more batchable work than they think; it's just been quietly running through the real-time API.

3. Context reuse and trimming

Agents accumulate transcript. By the final turn you're often re-sending tool results the model no longer needs. Two fixes:

Summarize stale turns instead of carrying them verbatim.
Cap the loop — a hard limit on tool-calls both bounds cost and catches runaway agents.

Every turn and every token you stop re-sending comes off the quadratic part of the bill, so trimming pays back more than it looks.

4. Model routing

Not every step needs your best model. Route the cheap, structural turns — intent detection, tool selection, validation — to a small fast model, and spend the frontier model only on the final synthesis. On a multi-step agent this is often a bigger win than any single price difference between providers (how I pick tiers).

5. Kill redundant and deterministic calls

The cheapest token is the one you never send:

Cache deterministic results (embeddings, classifications of identical inputs) in your own store.
Pre-compute anything that doesn't need to be in the hot path.
Deduplicate — agents often re-ask the model something already in context.

Putting it together

These compound. Caching the prefix, routing the loop turns to a small model, and trimming context together routinely take a five-figure monthly estimate down by more than half — with no change the user can perceive. Model the before/after with the cost calculator: set your real workload, then imagine the loop one tool-call shorter and the prefix cached, and watch the curve drop.

The goal isn't a cheaper model. It's not paying for the same tokens twice.

Want this run against your real traffic and turned into a prioritized list? That's a production-readiness audit.