Context Window

The maximum amount of text a language model can process in a single inference call, measured in tokens — the fundamental memory limit of an LLM.

The context window (or context length) of a Large Language Model is the maximum amount of text the model can process in a single inference call. It is measured in tokens — typically 0.5 to 0.75 of a word in English. Modern models range from 8K to 2M tokens.

Why it matters

The context window is the LLM’s working memory. Everything the model “knows” about the current task — the system prompt, conversation history, retrieved documents, the current question — must fit inside it. If your input is too long, you have to:

How it has grown

Context windows have expanded dramatically:

EraTypical contextExample models
2019-2020512-2KGPT-2, GPT-3
2022-20234K-8KGPT-3.5, Llama 1
2023-mid32K-100KGPT-4, Claude 2, Llama 2
2024200K-1MClaude 3, Gemini 1.5
2024-late1M-2MGemini 1.5 Pro, Magic

Each leap has unlocked new use cases — long document analysis, code repositories as context, multi-hour conversation history, agent trajectories.

Why long context is hard

The Transformer architecture has quadratic attention cost: doubling the context length quadruples the compute. This is why context windows didn’t simply grow from 2K to 2M overnight. Recent advances that made long context practical include:

Lost in the middle

A surprising empirical finding: LLMs don’t use their full context uniformly. They attend well to the beginning and end of the context, but performance degrades for information in the middle. The “Lost in the Middle” paper (Liu et al., 2023) documented this across multiple models.

Practical implications:

Counting tokens

Different models use different tokenizers:

Tooling:

Always count before sending — exceeding the context window either errors or silently truncates.

Effective context

A model’s nominal context window is the maximum, not the optimal. In practice, performance degrades before you hit the limit:

For most production systems, 8K-32K is the sweet spot. Beyond that, RAG usually beats stuffing everything in the prompt.

See also