Context window
Context window
The maximum number of tokens an LLM can consider per forward pass. 2026 frontier: 1M+ for some models (Claude 4.7, Gemini 2.5 Pro). Bigger window ≠ better answer — recall degrades inside long contexts.
The context window is the input + output budget for a single LLM call, counted in tokens (~0.75 words for English). When the spec says "200k context", it includes the system prompt, conversation history, any retrieved chunks, AND the room for the model's response.
Bigger windows enable longer documents, deeper agent histories, and "fit-your-whole-codebase" workflows. But: recall is non-uniform across the window. Most models recall content from the start and end better than the middle ("lost-in-the-middle"). Above ~100k tokens, recall on specific facts often drops sharply.
Practical answer: RAG is still relevant even at 1M context. Don't replace retrieval with brute-force context-stuffing; combine them.
See also
-
RAG
Retrieval-Augmented GenerationLooking up relevant context from an external store (vector DB, docs, your own corpus) and stuffing it into the LLM prompt before answering. Reduces hallucination, costs less than fine-tuning, but adds a retrieval failure mode of its own.
-
MoE
Mixture of ExpertsA model architecture where each forward pass activates only a fraction of total parameters via a learned router. Mixtral, DeepSeek-V3, and Llama 4 use it. Bigger total parameter count, similar compute per token.