What does a rag chatbot cost to run?
A retrieval-augmented chatbot answering questions over a knowledge base. High volume, retrieval on every request, no agentic tool calls, so cost scales cleanly with traffic and retrieved context. On GPT-4o mini this works out to about $133 a month; here is the figure across every model and what drives it.
assumptions
A planning estimate for this shape of workload. Tune any of it in the calculator.
- 5,000 questions a day
- ~800 prompt tokens plus ~3,000 retrieved tokens per question
- ~500 output tokens per answer
- No tool calls (single retrieval-then-answer turn)
monthly_cost · GPT-4o mini
$133/ month
- Input tokens3.8k/req · agentic context
- $85.50
- Output tokens500/req
- $45.00
- Embeddings (RAG)query embedding per request
- $2.40
3.8k input · 500 output · 1 LLM turn / request
cost_by_model
A rag chatbot across every model
| model | cost / month |
|---|---|
| Gemini 1.5 FlashGoogle (Vertex) | $67.65cheapest |
| GPT-4o miniOpenAI · shown above | $133 |
| Claude HaikuAnthropic | $758 |
| Gemini 1.5 ProGoogle (Vertex) | $1,090 |
| GPT-4oOpenAI | $2,177 |
| Claude SonnetAnthropic | $2,837 |
| Claude OpusAnthropic | $14,177 |
cheapest · public list prices as of 2026-06 · planning estimate, not a quote
what_drives_it
Where the money goes
The retrieved context injected into every request is the biggest line; trimming chunks or raising relevance cuts the bill directly.
The cheapest option here, Gemini 1.5 Flash, comes to about $67.65 a month against $133 on GPT-4o mini. Whether the cheaper model fits is a question for your evaluation set, not the price sheet. The bigger lever is usually the workload itself: caching re-sent context, trimming what each turn carries, and capping the tool loop move the bill more than swapping models does.
faq
Questions & answers
- How much does a rag chatbot cost per month?
- On GPT-4o mini, about $133 a month at 5,000 requests a day with the assumptions below. The cheapest model compared here, Gemini 1.5 Flash, runs about $67.65 for the same workload. Your real figure moves with volume and tokens, so tune it in the calculator.
- What makes a rag chatbot expensive?
- The retrieved context injected into every request is the biggest line; trimming chunks or raising relevance cuts the bill directly.
- Which model is cheapest for a rag chatbot?
- Gemini 1.5 Flash, at about $67.65 a month for this workload. Cheaper is not automatically better: a model that needs retries or longer prompts can cost more in practice, so test the candidates on your own evaluation set before committing.
A cost estimate is a start. Making an agent cheap in production is the work.
Prompt caching, context trimming, and the right model per step usually cut an agent's bill by more than half. Book a call, or leave your email and I'll reach out.
Prefer proof first? See how this plays out in real case studies →