What happens when you exceed the context window?

Either the request is rejected for being too long, or the system silently drops the oldest tokens to make room, so the model loses the earliest part of the conversation or documents. Both are failure modes you design around with summarisation, truncation, or retrieval.

Is a bigger context window always better?

No. A larger window lets you include more, but every token you include is billed on each call and can spread the model's attention thinner, sometimes lowering answer quality. Send the smallest context that answers the question well rather than filling the window because it is there.

Context Window

The context window is the maximum number of tokens a model can consider at once, covering the prompt, any retrieved or conversation history, and the response. Exceed it and the oldest content is dropped or the call fails.

also: context window · context length · token limit

the window holds it allprompt + history + retrieval + outputa budget to spend, not fill

Everything the model sees for a request has to fit in one window: system prompt, conversation so far, retrieved documents, tools and their outputs, and the answer it is about to generate. Windows range from a few thousand tokens to hundreds of thousands or more, but a big window is a budget, not a target. Filling it costs money on every call and can dilute the model's attention, so more context is not automatically better.

Long conversations and agents are where the window bites. Each turn re-sends the history, so a chat or agent loop creeps toward the limit until you summarise, truncate, or retrieve only what is relevant. Hitting the ceiling either drops the earliest messages (the model loses the start) or errors outright, so managing the window is part of designing anything that runs for more than one turn.

free_toolAI Prompt Token & Cost InspectorCount a prompt's tokens with a real tokenizer and price it across Claude, GPT-4o, and Gemini.

related_terms

faq

Questions & answers

What happens when you exceed the context window?: Either the request is rejected for being too long, or the system silently drops the oldest tokens to make room, so the model loses the earliest part of the conversation or documents. Both are failure modes you design around with summarisation, truncation, or retrieval.
Is a bigger context window always better?: No. A larger window lets you include more, but every token you include is billed on each call and can spread the model's attention thinner, sometimes lowering answer quality. Send the smallest context that answers the question well rather than filling the window because it is there.

Want this applied to your stack, not just defined?

The free tools run the numbers; an audit tells you where the real cost and risk are. Book a call, or leave your email and I'll reach out.

Book a call

Prefer proof first? See how this plays out in real case studies →