Transformer Architecture Deep Dive: Encoder-Decoder, Multi-Head Attention, Positional Encoding Explained

Transformer Architecture Deep Dive: Encoder-Decoder, Multi-Head Attention, Positional Encoding Explained

Last time we covered how attention mechanisms work. Today we will look at the bigger picture: how attention mechanisms assemble into Transformer -- the architecture that powers virtually every modern large language model. By the end of this post, you will understand not only the structural components, but also the design reasoning behind each choice and how the variants differ in practice.

1. The Birth of Transformer

In 2017, Google published the landmark paper Attention Is All You Need (Vaswani et al.). The core thesis was bold:

Completely abandon RNNs and build sequence models using only attention mechanisms.

At the time, the state of the art for machine translation and language modeling relied on recurrent architectures -- LSTMs and GRUs -- which processed tokens sequentially. This sequential nature created a fundamental bottleneck: tokens could not be processed in parallel, and long-range information had to propagate through many time steps, leading to vanishing gradients.

Transformer solved both problems simultaneously. By replacing recurrence with self-attention, every token in a sequence could interact with every other token in a single layer. The result was a model that trained dramatically faster and captured dependencies across much longer distances. This single paper became the architectural foundation for GPT, BERT, LLaMA, Claude, Gemini, and virtually every large language model that followed.

2. Overall Architecture: Encoder-Decoder

The original Transformer was designed for sequence-to-sequence tasks like machine translation. Its high-level structure is an Encoder-Decoder:

Input sequence Output sequence
| |
v |
+---------+ +---------+
| Encoder |---ctx---->| Decoder |
+---------+ +---------+

  • Encoder: Reads the entire input sequence and generates a rich set of contextual representations (often called context vectors).
  • Decoder: Uses those context vectors to generate the target sequence one token at a time, attending to previously generated tokens.

Architectural Variants

Not every downstream task needs both halves. Three major variants emerged:

  • Encoder-only (BERT): Uses only the encoder stack. Excellent for understanding tasks like classification, named entity recognition, and semantic search, where you need rich representations of the input but do not need to generate new text.
  • Decoder-only (GPT, Claude, LLaMA): Uses only the decoder stack with causal masking. Ideal for text generation, dialogue, and any autoregressive task where you produce output token by token.
  • Full Encoder-Decoder (T5, BART): Retains both stacks. Still preferred for classic sequence-to-sequence tasks like translation, summarization, and question answering over structured data.

3. Encoder in Detail

The encoder consists of N identical layers (6 in the original paper). Each layer has two sub-components:

  1. Multi-Head Self-Attention + Residual Connection + LayerNorm
  2. Feed-Forward Network (FFN) + Residual Connection + LayerNorm

Let us examine each sub-component carefully.

Multi-Head Self-Attention

Self-attention allows every token in the input to attend to every other token. The mechanism computes three vectors for each token: Query (Q), Key (K), and Value (V). The attention weight between token i and token j is the dot product of Qi with Kj, scaled by √dk, followed by softmax. The output for token i is the weighted sum of all V vectors.

Multi-head attention runs multiple self-attention operations in parallel (8 heads in the original paper), each learned with separate projection matrices. Different heads specialize in different kinds of relationships -- some focus on syntactic structure, others on semantic similarity, others on positional proximity.

Residual Connections and Layer Normalization

Each sub-component is wrapped with a residual connection: the input of the sub-layer is added to its output. This creates a gradient highway through the network, preventing the vanishing gradient problem even in very deep stacks. Layer normalization is applied to stabilize training by normalizing activations to zero mean and unit variance.

Feed-Forward Network

The FFN is a two-layer fully connected network with a ReLU activation applied independently to each token's representation. It operates on each position separately and identically, transforming the attention output into a richer representation. The FFN's inner dimension is typically 4× the model dimension (2048 when dmodel = 512, for instance).

4. Decoder in Detail

The decoder also has N layers, but each layer contains three sub-components instead of two:

  1. Masked Multi-Head Self-Attention
  2. Cross-Attention (Encoder-Decoder Attention)
  3. Feed-Forward Network

Masked Self-Attention

During generation, the decoder must not peek at future tokens. Masked attention sets attention scores for future positions to negative infinity before softmax, forcing each position to only attend to past positions and itself. This is what makes GPT-style models autoregressive -- they generate left to right, always respecting causal order.

Cross-Attention

This is where the decoder connects to the encoder. Cross-attention uses the decoder's current representation as the Query, while Key and Value come from the encoder's output. This mechanism lets the decoder selectively focus on relevant parts of the input sequence at each generation step. In translation, for example, cross-attention learns soft alignment between source and target words.

Autoregressive Generation

Because of masking, the decoder can generate sequences in a loop: it takes the previously generated tokens as input, produces the next token via a final linear + softmax layer, appends it to the sequence, and repeats. This is exactly how GPT produces coherent paragraphs one word at a time.

5. Positional Encoding

A critical insight: pure self-attention is permutation-equivariant -- it treats the input as an unordered set of tokens. Without positional information, "the cat sat on the mat" and "mat the on sat cat the" would produce identical outputs. Transformer injects positional information via positional encoding.

The original paper used deterministic sine and cosine functions of varying frequencies:

PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

These encodings are added directly to the input embeddings before the first layer. Each dimension corresponds to a sinusoid at a different frequency, creating a unique positional fingerprint for every position.

Modern Alternatives

Many newer models replace sinusoidal encoding with learned positional embeddings (GPT, BERT) or more advanced schemes like Rotary Position Embedding (RoPE, used in LLaMA), ALiBi (used in BLOOM), or Relative Position Bias (used in T5). RoPE in particular has become dominant because it naturally encodes relative distances and extends well to longer contexts during fine-tuning.

6. Complete Architecture Diagram

graph TB
subgraph Encoder
E1["Input Embedding + Positional Encoding"]
E2["Multi-Head Self-Attention"]
E3["Add and Norm"]
E4["Feed-Forward Network"]
E5["Add and Norm"]
E1 --> E2 --> E3 --> E4 --> E5
end
subgraph Decoder
D1["Output Embedding + Positional Encoding"]
D2["Masked Multi-Head Self-Attention"]
D3["Add and Norm"]
D4["Multi-Head Cross-Attention"]
D5["Add and Norm"]
D6["Feed-Forward Network"]
D7["Add and Norm"]
D1 --> D2 --> D3 --> D4 --> D5 --> D6 --> D7
end
subgraph Output
O1["Linear"]
O2["Softmax"]
O1 --> O2
end
E5 -->|"K, V"| D4
D7 --> O1

Notice how the encoder's output is fed as K and V into the decoder's cross-attention block. This is the bridge that connects understanding (encoding) to generation (decoding).

7. The Three Variants Compared

Variant Models Use Case Parameters
Encoder-only BERT, RoBERTa Classification, NER, Embedding ~110M–340M
Decoder-only GPT-4, Claude, LLaMA 3 Generation, dialogue, coding 7B–70B+
Encoder-Decoder T5, FLAN-T5, BART Translation, summarization ~220M–11B

Each variant optimizes for different computational patterns. Decoder-only models dominate today because they scale well for generation and can be adapted to understanding tasks through instruction tuning, blurring the original distinction.

8. Why Transformer Is So Successful

  • Parallel Training: Unlike RNNs, all tokens are processed simultaneously during training, fully exploiting GPU parallelism.
  • Long-Range Dependencies: Self-attention's path length between any two positions is O(1), making it trivial for distant tokens to influence each other.
  • Scalability: The architecture scales predictably with larger models, more data, and more compute -- the Scaling Laws hypothesis that underpins modern LLM development.
  • Versatility: The same architecture handles translation, generation, search, coding, vision (ViT), speech, and more.
  • Inductive Bias: Transformer makes minimal assumptions about data structure, letting it learn task-specific patterns purely from data.

9. Understanding Transformer in Practice: A Mental Model

One of the most helpful ways to understand Transformer is to think of it as a sophisticated information routing system. Imagine you are at a large conference with thousands of attendees (tokens). In an RNN, you can only talk to the person next to you, one at a time. In a Transformer, every person can simultaneously share a summary of what they know (Value), express what they are curious about (Query), and indicate what they care about (Key). The attention mechanism determines how much each person listens to every other person based on the match between their curiosity and others' interests.

The "multi-head" part means this conference happens multiple times in parallel, each with different discussion topics. Some groups focus on grammar, others on meaning, others on context. The combination of all these parallel discussions produces a remarkably rich understanding of the entire room.

10. Common Misconceptions About Transformer

"Attention is all you need" means positional encoding is optional. Absolutely not. Without positional encoding, the model has no way to distinguish word order. The sinusoidal or learned positional embeddings are essential, not optional.

"Self-attention is quadratic, so it doesn't scale." While the theoretical complexity is O(n²) in sequence length, modern optimization techniques like Flash Attention, sparse attention patterns, and efficient hardware utilization make this manageable for practical sequence lengths. Research into linear attention mechanisms is also actively addressing this limitation.

"More attention heads always mean better performance." Not necessarily. Beyond a certain point, additional heads yield diminishing returns while increasing computational cost. The optimal number depends on the model size and task complexity. The original Transformer used 8 heads, while GPT-3 uses 96 heads -- but the scaling follows careful empirical tuning.

"Transformer can only process text." The architecture has been successfully adapted to images (Vision Transformer), audio (Whisper), video (ViT-Video), protein structures (AlphaFold 2), and even tabular data. The core idea of self-attention over sequences generalizes remarkably well across modalities.

11. The Building Blocks Summarized

Transformer = Attention + Positional Encoding + Residual Connections + Layer Normalization + Feed-Forward Networks.

Every modern LLM is a composition of these five ingredients, stacked dozens of times, trained on trillions of tokens. Understanding these components gives you the foundation to grasp how models like GPT-4 and Claude work under the hood.

Part 2 of From Attention to LLM series. Next: we will explore how pre-training and fine-tuning adapt Transformer to real-world tasks.