Context Window

The maximum amount of text a language model can process in a single inference call, measured in tokens — the fundamental memory limit of an LLM.

The LLM's working memory. Everything the model "knows" about the current task must fit inside it — the system prompt, history, retrieved docs, the question.

February 1, 2024 updated November 15, 2024 3 min ai

Also known as: context length, context size, token limit

The context window (or context length) of a Large Language Model is the maximum amount of text the model can process in a single inference call. It is measured in tokens — typically 0.5 to 0.75 of a word in English. Modern models range from 8K to 2M tokens.

Why it matters

The context window is the LLM’s working memory. Everything the model “knows” about the current task — the system prompt, conversation history, retrieved documents, the current question — must fit inside it. If your input is too long, you have to:

Truncate — drop the oldest or least relevant content.
Summarize — compress prior context to fit more.
Chunk — split the task across multiple calls, losing some coherence.
Use RAG — only inject the relevant subset.

How it has grown

Context windows have expanded dramatically:

Era	Typical context	Example models
2019-2020	512-2K	GPT-2, GPT-3
2022-2023	4K-8K	GPT-3.5, Llama 1
2023-mid	32K-100K	GPT-4, Claude 2, Llama 2
2024	200K-1M	Claude 3, Gemini 1.5
2024-late	1M-2M	Gemini 1.5 Pro, Magic

Each leap has unlocked new use cases — long document analysis, code repositories as context, multi-hour conversation history, agent trajectories.

Why long context is hard

The Transformer architecture has quadratic attention cost: doubling the context length quadruples the compute. This is why context windows didn’t simply grow from 2K to 2M overnight. Recent advances that made long context practical include:

Flash Attention — I/O-aware implementation that reduces memory and compute.
Sparse / sliding-window attention — attend only to a window of nearby tokens plus a few “global” tokens.
Rotary Position Embeddings (RoPE) — generalize to longer sequences than seen in training.
Ring attention — distribute the attention computation across many GPUs.
State Space Models (Mamba, etc.) — alternative architectures with linear cost.

Lost in the middle

A surprising empirical finding: LLMs don’t use their full context uniformly. They attend well to the beginning and end of the context, but performance degrades for information in the middle. The “Lost in the Middle” paper (Liu et al., 2023) documented this across multiple models.

Practical implications:

Place the most important information at the start and end of the prompt.
For multi-document RAG, put the most relevant chunk first, not in the middle.
For long-context summarization, structure the prompt so key facts don’t get buried.

Counting tokens

Different models use different tokenizers:

GPT-4 / GPT-4o — o200k_base (about 750 tokens per 1000 English words).
Claude 3 — roughly similar.
Llama 3 — sentencepiece BPE.

Tooling:

tiktoken for OpenAI models.
Anthropic SDK’s count_tokens().
Most agent frameworks expose token_count for the active model.

Always count before sending — exceeding the context window either errors or silently truncates.

Effective context

A model’s nominal context window is the maximum, not the optimal. In practice, performance degrades before you hit the limit:

Latency — long contexts are slow. A 1M-token prompt can take minutes to process.
Cost — most APIs charge per token, both input and output.
Quality — many models perform best at 25-50% of their nominal max.

For most production systems, 8K-32K is the sweet spot. Beyond that, RAG usually beats stuffing everything in the prompt.

Context Window

Why it matters

How it has grown

Why long context is hard

Lost in the middle

Counting tokens

Effective context

See also

Connected to

Mentioned by

Related articles

Large Language Model

Transformer

Retrieval-Augmented Generation

AI Agent

References