Transformer

The neural network architecture that revolutionized NLP through self-attention, enabling parallel sequence processing and forming the basis of all modern LLMs.

From: LLM Wiki URL: llm-wiki.pages.dev/concepts/transformer Created: January 10, 2024 Updated: November 5, 2024 Read time: 3 min

The Transformer is a neural network architecture introduced in 2017 by Vaswani et al. in the paper “Attention Is All You Need”. It has become the dominant architecture for natural language processing and is the foundation of all modern Large Language Models.

Why it matters

Before Transformers, sequence models were dominated by RNNs (Recurrent Neural Networks) and LSTMs, which processed tokens one at a time. This made them:

Slow to train — no parallelism within a sequence.
Hard to scale — long-range dependencies were difficult to learn.
Memory-limited — gradients struggled to flow across many time steps.

The Transformer replaced recurrence with self-attention, a mechanism that lets every position in a sequence attend to every other position in parallel. This unlocks massive parallelism on GPUs and dramatically improves training efficiency.

The key mechanism: self-attention

For each token, the model computes three vectors: a query (Q), a key (K), and a value (V). The attention score between two tokens is computed as:

attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

In plain language: each token asks “which other tokens are relevant to me?”, the model scores all other tokens, and uses those scores to weight a sum of their value vectors. This produces a context-aware representation of each token.

Multi-head attention runs several attention computations in parallel and concatenates the results, allowing the model to attend to different kinds of relationships simultaneously.

The architecture

A Transformer block has two sub-layers:

Multi-head self-attention — lets tokens communicate with each other.
Position-wise feed-forward network — a small MLP applied independently to each position.

Each sub-layer is wrapped with residual connections and layer normalization, making deep networks trainable.

The original paper stacked 6 of these blocks for the encoder and 6 for the decoder. Modern LLMs use much deeper stacks — 80 to 200+ layers.

The encoder-decoder split

The original Transformer had two halves:

Encoder — processes the input sequence bidirectionally; used for tasks like classification and translation.
Decoder — generates the output sequence autoregressively; used for generation.

Modern LLMs come in three flavors:

Encoder-only (BERT family) — for understanding tasks.
Decoder-only (GPT family) — for generation. Most modern LLMs.
Encoder-decoder (T5, BART) — for sequence-to-sequence tasks.

Positional encoding

Self-attention is permutation-invariant — it treats input as a set, not a sequence. To recover order, the model adds positional encodings to the input embeddings. The original paper used fixed sinusoidal encodings; modern models typically use learned or rotary positional embeddings (RoPE).

Impact

The Transformer has transformed not just NLP but also:

Computer vision — Vision Transformers (ViT) match or beat CNNs.
Speech — Whisper, SeamlessM4T.
Multimodal — CLIP, GPT-4V, Gemini.
Biology — AlphaFold 2, ESM-3 for protein structure and design.
Code — Codex, Code Llama, Cursor.