Retrieval-Augmented Generation

A technique that grounds LLM outputs in external knowledge by retrieving relevant documents at query time, reducing hallucination and extending the model's knowledge.

Look it up before you answer. RAG sandwiches external, current, or private knowledge into the LLM's context so its answers rest on something real.

February 20, 2024 updated December 1, 2024 3 min ai

Also known as: RAG, retrieval augmented generation

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models by grounding their responses in external, up-to-date information retrieved at query time. It addresses two of the biggest LLM limitations: hallucination and knowledge cutoff.

The core idea

Instead of relying solely on what an LLM learned during training, RAG:

Converts a user query into a vector representation (an embedding).
Searches a vector database of pre-indexed documents for the most relevant chunks.
Inserts those chunks into the LLM’s context window along with the original query.
Lets the LLM generate a response grounded in the retrieved evidence.

This pattern is sometimes called “the LLM is the reasoning engine, the retrieval system is the memory”.

Why it works

RAG improves on a vanilla LLM in several ways:

Reduced hallucination — the model is anchored to retrieved facts, not just its parametric memory.
Up-to-date knowledge — change the retrieval index, change what the system “knows”.
Source attribution — retrieved documents can be cited, letting users verify claims.
Domain specialization — swap the retrieval index to make the same LLM an expert in medicine, law, your company’s docs, etc.
Smaller models suffice — a 7B model with good RAG can outperform a 70B model without it on knowledge-intensive tasks.

Architecture

A typical RAG pipeline has three components:

Indexer — splits documents into chunks (often 200-1000 tokens), embeds each chunk, and stores vectors in a database like FAISS, Pinecone, Weaviate, or pgvector.
Retriever — given a query, embeds it and returns the top-k most similar chunks (often using cosine similarity or BM25, sometimes hybrid).
Generator — feeds the retrieved chunks plus the query into the LLM with a prompt that says “answer using only the provided context, cite your sources”.

Advanced patterns

Hybrid search — combine vector similarity with keyword search (BM25) for better recall.
Re-ranking — use a cross-encoder model to re-score the top-k retrieved chunks before passing them to the LLM.
Multi-hop retrieval — break complex queries into sub-questions, retrieve for each, then synthesize.
Self-RAG — let the model decide when to retrieve and critique its own outputs.
Graph RAG — retrieve over a knowledge graph instead of (or in addition to) vector chunks.

When RAG is the wrong choice

RAG isn’t a silver bullet. Consider alternatives when:

The knowledge is small enough to fit in the prompt — use in-context learning instead.
The task requires precise logic over many documents — fine-tuning or agentic workflows may work better.
The retrieval index is the bottleneck — better chunking, embedding, or domain-specific preprocessing is the fix.
Hallucination isn’t the main failure mode — for example, in creative writing.

RAG vs fine-tuning

The classic trade-off:

	RAG	Fine-tuning
Cost	Lower — just maintain an index	Higher — GPU training runs
Update frequency	Real-time — re-index	Periodic — re-train
Source attribution	Built-in	None — knowledge baked into weights
Capacity	Limited by context window	Limited by model size
Best for	Factual, rapidly-changing knowledge	Style, format, domain jargon

In practice, the strongest systems combine both: fine-tune for style and domain familiarity, RAG for fresh facts.

Retrieval-Augmented Generation

The core idea

Why it works

Architecture

Advanced patterns

When RAG is the wrong choice

RAG vs fine-tuning

See also

Connected to

Mentioned by

Related articles

Large Language Model

Embedding

Knowledge Base

Context Window

References