Transformer

The neural network architecture that revolutionized NLP through self-attention, enabling parallel sequence processing and forming the basis of all modern LLMs.

The Transformer is a neural network architecture introduced in 2017 by Vaswani et al. in the paper “Attention Is All You Need”. It has become the dominant architecture for natural language processing and is the foundation of all modern Large Language Models.

Why it matters

Before Transformers, sequence models were dominated by RNNs (Recurrent Neural Networks) and LSTMs, which processed tokens one at a time. This made them:

The Transformer replaced recurrence with self-attention, a mechanism that lets every position in a sequence attend to every other position in parallel. This unlocks massive parallelism on GPUs and dramatically improves training efficiency.

The key mechanism: self-attention

For each token, the model computes three vectors: a query (Q), a key (K), and a value (V). The attention score between two tokens is computed as:

attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V

In plain language: each token asks “which other tokens are relevant to me?”, the model scores all other tokens, and uses those scores to weight a sum of their value vectors. This produces a context-aware representation of each token.

Multi-head attention runs several attention computations in parallel and concatenates the results, allowing the model to attend to different kinds of relationships simultaneously.

The architecture

A Transformer block has two sub-layers:

  1. Multi-head self-attention — lets tokens communicate with each other.
  2. Position-wise feed-forward network — a small MLP applied independently to each position.

Each sub-layer is wrapped with residual connections and layer normalization, making deep networks trainable.

The original paper stacked 6 of these blocks for the encoder and 6 for the decoder. Modern LLMs use much deeper stacks — 80 to 200+ layers.

The encoder-decoder split

The original Transformer had two halves:

Modern LLMs come in three flavors:

Positional encoding

Self-attention is permutation-invariant — it treats input as a set, not a sequence. To recover order, the model adds positional encodings to the input embeddings. The original paper used fixed sinusoidal encodings; modern models typically use learned or rotary positional embeddings (RoPE).

Impact

The Transformer has transformed not just NLP but also:

The architecture has proven remarkably general, often outperforming domain-specific approaches by simply scaling up data and parameters.

See also