Claude vs GPT-4o vs Gemini: a real cost breakdown

"Which model is cheapest?" is the wrong question. The honest answer is it depends on the workload — and the gaps between models are big enough that picking by sticker price alone routinely costs teams 3–5× more than they need to spend.

Here's how I actually reason about it.

The list prices aren't the comparison

Per-million-token prices span two orders of magnitude — from frontier models at premium rates down to small models that cost cents. But three things make the headline number misleading:

Input vs output split. Most providers charge 4–5× more for output than input. A workload that's heavy on reading (RAG, long context, agent loops) and light on writing has a totally different effective price than one that generates long documents.
Tokens-per-task, not tokens. A model that needs two attempts or a longer chain-of-thought to get the answer can be "cheaper per token" and still cost more per finished task.
Caching support. If a model's provider caches a stable prefix cheaply, the re-sent context in an agent loop bills at a fraction of the rate — which can flip the ranking entirely. (More on that in Cutting LLM spend.)

A workload-first way to choose

Instead of one ranking, I sort models into three tiers and match them to the job:

Small / fast (e.g. GPT-4o mini, Gemini Flash, Claude Haiku). Cents per million tokens. Perfect for classification, routing, extraction, tool-selection turns, and high-volume cheap calls. On a routing-heavy agent, putting these on the loop turns and reserving a big model for the final answer is the single biggest cost win.

Mid (e.g. GPT-4o, Gemini Pro, Claude Sonnet). The workhorse tier. Strong reasoning at a price you can run at scale. Most production features should start here and only move up if quality demands it.

Frontier (e.g. Claude Opus). Premium pricing, top-end reasoning. Worth it for genuinely hard synthesis, long-horizon agents, and tasks where a wrong answer is expensive — but ruinous if you route trivial turns through it. This is where agent bills explode (see the agentic context tax).

What actually moves your bill

In rough order of impact for a typical agentic SaaS feature:

Which tier runs the loop turns. Demote routing/tool-selection to a small model.
Whether the prefix is cached. Caching attacks the re-sent context directly.
Output length discipline. Cap and structure outputs; output tokens are the expensive ones.
Context hygiene. Don't carry the whole transcript to every turn.
The base model choice. Real, but usually smaller than the four levers above.

Run your own numbers

Rather than trust a generic benchmark, put your real shape — prompt size, output length, tool-calls, volume — into the AI Agent Cost Calculator and switch the model dropdown. You'll often find a mid-tier model with a tighter loop beats a frontier model on both cost and latency, with no quality loss the user can feel.

That's the comparison that matters: not price per token, but dollars per finished task at the quality bar your product needs.

Not sure where your workload lands? A readiness audit is exactly this analysis on your real traffic.