Retrieval-Augmented Generation

A technique that grounds LLM outputs in external knowledge by retrieving relevant documents at query time, reducing hallucination and extending the model's knowledge.

From: LLM Wiki URL: llm-wiki.pages.dev/concepts/retrieval-augmented-generation Created: February 20, 2024 Updated: December 1, 2024 Read time: 3 min

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models by grounding their responses in external, up-to-date information retrieved at query time. It addresses two of the biggest LLM limitations: hallucination and knowledge cutoff.

The core idea

Instead of relying solely on what an LLM learned during training, RAG:

Converts a user query into a vector representation (an embedding).
Searches a vector database of pre-indexed documents for the most relevant chunks.
Inserts those chunks into the LLM’s context window along with the original query.
Lets the LLM generate a response grounded in the retrieved evidence.

This pattern is sometimes called “the LLM is the reasoning engine, the retrieval system is the memory”.

Why it works

RAG improves on a vanilla LLM in several ways:

Reduced hallucination — the model is anchored to retrieved facts, not just its parametric memory.
Up-to-date knowledge — change the retrieval index, change what the system “knows”.
Source attribution — retrieved documents can be cited, letting users verify claims.
Domain specialization — swap the retrieval index to make the same LLM an expert in medicine, law, your company’s docs, etc.
Smaller models suffice — a 7B model with good RAG can outperform a 70B model without it on knowledge-intensive tasks.

Architecture

A typical RAG pipeline has three components:

Indexer — splits documents into chunks (often 200-1000 tokens), embeds each chunk, and stores vectors in a database like FAISS, Pinecone, Weaviate, or pgvector.
Retriever — given a query, embeds it and returns the top-k most similar chunks (often using cosine similarity or BM25, sometimes hybrid).
Generator — feeds the retrieved chunks plus the query into the LLM with a prompt that says “answer using only the provided context, cite your sources”.

Advanced patterns

Hybrid search — combine vector similarity with keyword search (BM25) for better recall.
Re-ranking — use a cross-encoder model to re-score the top-k retrieved chunks before passing them to the LLM.
Multi-hop retrieval — break complex queries into sub-questions, retrieve for each, then synthesize.
Self-RAG — let the model decide when to retrieve and critique its own outputs.
Graph RAG — retrieve over a knowledge graph instead of (or in addition to) vector chunks.

When RAG is the wrong choice

RAG isn’t a silver bullet. Consider alternatives when:

The knowledge is small enough to fit in the prompt — use in-context learning instead.
The task requires precise logic over many documents — fine-tuning or agentic workflows may work better.
The retrieval index is the bottleneck — better chunking, embedding, or domain-specific preprocessing is the fix.
Hallucination isn’t the main failure mode — for example, in creative writing.

RAG vs fine-tuning

The classic trade-off:

	RAG	Fine-tuning
Cost	Lower — just maintain an index	Higher — GPU training runs
Update frequency	Real-time — re-index	Periodic — re-train
Source attribution	Built-in	None — knowledge baked into weights
Capacity	Limited by context window	Limited by model size
Best for	Factual, rapidly-changing knowledge	Style, format, domain jargon