The maximum amount of text a language model can process in a single inference call, measured in tokens — the fundamental memory limit of an LLM.
The context window (or context length) of a Large Language Model is the maximum amount of text the model can process in a single inference call. It is measured in tokens — typically 0.5 to 0.75 of a word in English. Modern models range from 8K to 2M tokens.
The context window is the LLM’s working memory. Everything the model “knows” about the current task — the system prompt, conversation history, retrieved documents, the current question — must fit inside it. If your input is too long, you have to:
Context windows have expanded dramatically:
| Era | Typical context | Example models |
|---|---|---|
| 2019-2020 | 512-2K | GPT-2, GPT-3 |
| 2022-2023 | 4K-8K | GPT-3.5, Llama 1 |
| 2023-mid | 32K-100K | GPT-4, Claude 2, Llama 2 |
| 2024 | 200K-1M | Claude 3, Gemini 1.5 |
| 2024-late | 1M-2M | Gemini 1.5 Pro, Magic |
Each leap has unlocked new use cases — long document analysis, code repositories as context, multi-hour conversation history, agent trajectories.
The Transformer architecture has quadratic attention cost: doubling the context length quadruples the compute. This is why context windows didn’t simply grow from 2K to 2M overnight. Recent advances that made long context practical include:
A surprising empirical finding: LLMs don’t use their full context uniformly. They attend well to the beginning and end of the context, but performance degrades for information in the middle. The “Lost in the Middle” paper (Liu et al., 2023) documented this across multiple models.
Practical implications:
Different models use different tokenizers:
o200k_base (about 750 tokens per 1000 English words).Tooling:
tiktoken for OpenAI models.count_tokens().token_count for the active model.Always count before sending — exceeding the context window either errors or silently truncates.
A model’s nominal context window is the maximum, not the optimal. In practice, performance degrades before you hit the limit:
For most production systems, 8K-32K is the sweet spot. Beyond that, RAG usually beats stuffing everything in the prompt.