The neural network architecture that revolutionized NLP through self-attention, enabling parallel sequence processing and forming the basis of all modern LLMs.
The Transformer is a neural network architecture introduced in 2017 by Vaswani et al. in the paper “Attention Is All You Need”. It has become the dominant architecture for natural language processing and is the foundation of all modern Large Language Models.
Before Transformers, sequence models were dominated by RNNs (Recurrent Neural Networks) and LSTMs, which processed tokens one at a time. This made them:
The Transformer replaced recurrence with self-attention, a mechanism that lets every position in a sequence attend to every other position in parallel. This unlocks massive parallelism on GPUs and dramatically improves training efficiency.
For each token, the model computes three vectors: a query (Q), a key (K), and a value (V). The attention score between two tokens is computed as:
attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
In plain language: each token asks “which other tokens are relevant to me?”, the model scores all other tokens, and uses those scores to weight a sum of their value vectors. This produces a context-aware representation of each token.
Multi-head attention runs several attention computations in parallel and concatenates the results, allowing the model to attend to different kinds of relationships simultaneously.
A Transformer block has two sub-layers:
Each sub-layer is wrapped with residual connections and layer normalization, making deep networks trainable.
The original paper stacked 6 of these blocks for the encoder and 6 for the decoder. Modern LLMs use much deeper stacks — 80 to 200+ layers.
The original Transformer had two halves:
Modern LLMs come in three flavors:
Self-attention is permutation-invariant — it treats input as a set, not a sequence. To recover order, the model adds positional encodings to the input embeddings. The original paper used fixed sinusoidal encodings; modern models typically use learned or rotary positional embeddings (RoPE).
The Transformer has transformed not just NLP but also:
The architecture has proven remarkably general, often outperforming domain-specific approaches by simply scaling up data and parameters.