Retrieval-Augmented Generation

A technique that grounds LLM outputs in external knowledge by retrieving relevant documents at query time, reducing hallucination and extending the model's knowledge.

Retrieval-Augmented Generation (RAG) is a technique that enhances Large Language Models by grounding their responses in external, up-to-date information retrieved at query time. It addresses two of the biggest LLM limitations: hallucination and knowledge cutoff.

The core idea

Instead of relying solely on what an LLM learned during training, RAG:

  1. Converts a user query into a vector representation (an embedding).
  2. Searches a vector database of pre-indexed documents for the most relevant chunks.
  3. Inserts those chunks into the LLM’s context window along with the original query.
  4. Lets the LLM generate a response grounded in the retrieved evidence.

This pattern is sometimes called “the LLM is the reasoning engine, the retrieval system is the memory”.

Why it works

RAG improves on a vanilla LLM in several ways:

Architecture

A typical RAG pipeline has three components:

  1. Indexer — splits documents into chunks (often 200-1000 tokens), embeds each chunk, and stores vectors in a database like FAISS, Pinecone, Weaviate, or pgvector.
  2. Retriever — given a query, embeds it and returns the top-k most similar chunks (often using cosine similarity or BM25, sometimes hybrid).
  3. Generator — feeds the retrieved chunks plus the query into the LLM with a prompt that says “answer using only the provided context, cite your sources”.

Advanced patterns

When RAG is the wrong choice

RAG isn’t a silver bullet. Consider alternatives when:

RAG vs fine-tuning

The classic trade-off:

RAGFine-tuning
CostLower — just maintain an indexHigher — GPU training runs
Update frequencyReal-time — re-indexPeriodic — re-train
Source attributionBuilt-inNone — knowledge baked into weights
CapacityLimited by context windowLimited by model size
Best forFactual, rapidly-changing knowledgeStyle, format, domain jargon

In practice, the strongest systems combine both: fine-tune for style and domain familiarity, RAG for fresh facts.

See also