Transformer Anatomy: Attention + FFN Demystified
A deep dive into the Transformer architecture — how attention connects tokens and why the Feed-Forward Network is the real brain of the model. Plus the key to understanding Mixture of Experts (MoE).
Table of Contents
- The Transformer Block: Big Picture
- Self-Attention: The Model’s Eyes 👀
- The Intuition
- The Q/K/V Mechanism
- Multi-Head: Looking from Multiple Angles
- What Attention Does NOT Do
- Feed-Forward Network: The Model’s Brain 🧠
- The Intuition
- The Architecture: Expansion → Activation → Compression
- Why Expand Then Compress?
- The FFN as Associative Memory
- Gated FFN: The Modern Version
- The Duo in Action: Layer by Layer
- Attention vs FFN: The Numbers
- Why Bigger Models Are Better (and Why That’s a Problem)
- More Parameters = More Knowledge
- The Scaling Problem
- And That’s Why MoE Replaces the FFN
- The Observation
- The MoE Idea
- The Company Analogy
- Specialization Emerges Naturally
- Why Attention Stays Shared (and Not the Other Way Around)
- Practical Implications for Engineers
- Fine-Tuning with LoRA
- Pruning and Distillation
- Interpretability
- Summary
When people talk about Transformers, everyone cites attention. “Attention is All You Need,” the 2017 paper, cemented the idea. But in reality, attention is only half the story — and honestly, not the more interesting half.
The real brain of the Transformer is the Feed-Forward Network (FFN). And understanding the Attention + FFN duo is understanding why these models work — and why modern architectures like Mixture of Experts (MoE) make the design choices they do.
This article takes apart a Transformer block piece by piece. No intimidating formulas, no unnecessary jargon. If you know what an LLM is, you’ll follow everything.
The Transformer Block: Big Picture
A model like Llama 3 70B stacks 80 identical blocks. Each block does the same thing:
Input token (8192-dim vector)
│
▼
┌─────────────────────┐
│ Self-Attention │ ← "Who is related to what?"
│ (Multi-Head) │
└─────────┬───────────┘
│ + residual
▼
┌─────────────────────┐
│ Feed-Forward │ ← "What does it mean?"
│ Network (FFN) │
└─────────┬───────────┘
│ + residual
▼
Output token (8192-dim vector)
That’s it. 80 times. Each pass enriches the token’s representation a little more, layer by layer, until the model has enough understanding to predict the next token.
But what do these two components actually do?
Self-Attention: The Model’s Eyes 👀
The Intuition
Take this sentence:
“The developer who uses vLLM deployed his model on AWS”
To understand this sentence, you need to resolve relationships:
- “his” → refers to “the developer” (not vLLM, not AWS)
- “who uses vLLM” → qualifies “the developer”
- “his model” → the developer’s model, deployed on AWS
That’s exactly what attention does: for each token, it looks at all other tokens and computes how relevant each one is. It’s relational pattern matching.
The Q/K/V Mechanism
Under the hood, each token is transformed into three vectors:
- Query (Q) — “What am I looking for?”
- Key (K) — “What do I contain?”
- Value (V) — “If you find me relevant, here’s my content”
Token "his"
Q = "I'm looking for the possessor of this pronoun"
Token "developer"
K = "I'm a human subject, agent of an action"
V = [rich representation of the concept "developer"]
Score("his" → "developer") = Q_his · K_developer = HIGH ✓
Score("his" → "AWS") = Q_his · K_AWS = LOW ✗
The attention score is a simple dot product between Q and K. The more two vectors point in the same direction, the higher the score. After a softmax to normalize, each token gets a weighted average of the Values from all other tokens.
Multi-Head: Looking from Multiple Angles
A Transformer doesn’t have one attention but several in parallel (the “heads”). Llama 3 70B has 64 per layer.
Why? Because a word can be relevant to another in multiple ways:
Head 1: syntactic relations → "his" ← "developer" (subject)
Head 2: semantic relations → "model" ← "ML/AI" (domain)
Head 3: positional relations → nearby tokens to each other
Head 4: co-reference → "his" ← "developer" (same entity)
...
Each head has its own Q, K, V matrices. They independently learn to capture different types of relationships. The results are concatenated and projected into the output.
What Attention Does NOT Do
Here’s the crucial point: attention connects tokens to each other, but it doesn’t truly transform the information. It’s an essentially linear operation — matrix products and a weighted average. It answers “who looks at whom,” not “what does it mean.”
For that, you need the FFN.
Feed-Forward Network: The Model’s Brain 🧠
The Intuition
Let’s go back to our sentence. Attention has understood the relationships:
- “his” → “developer”
- “vLLM” ← “uses”
- “model” → “deployed” → “AWS”
Now the FFN thinks about this information:
“Ok, a developer + vLLM + deployment + AWS → MLOps context / production inference. The next likely word relates to infrastructure, serving, latency…”
The FFN transforms relational understanding into reasoning. This is where the model “thinks.”
The Architecture: Expansion → Activation → Compression
The FFN is surprisingly simple. Two linear layers with an activation in between:
input (4096 dims)
│
▼
[Projection UP] 4096 → 14336 ← expansion ×3.5
│
▼
[SiLU Activation] ← non-linearity
│
▼
[Projection DOWN] 14336 → 4096 ← compression
│
▼
output (4096 dims)
Dimensions vary by model (here Llama 3 8B), but the ratio is typically between ×2.7 and ×4.
Why Expand Then Compress?
It’s an inverted bottleneck, and there’s a deep reason for this design.
The expansion (4096 → 14336) projects the token into a much larger space. Intuitively, it gives the model “more room to think.” In this expanded space, individual dimensions can encode very specific concepts — one neuron for Python code, another for French grammar, another for historical dates.
The SiLU activation is the secret ingredient. Without non-linearity, stacking linear layers is equivalent to a single matrix multiplication — no reasoning possible. The non-linearity allows the network to learn complex transformations:
# SiLU (Sigmoid Linear Unit) = x * sigmoid(x)
def silu(x):
return x * torch.sigmoid(x)
# Smooth, allows negative gradients
# Better than ReLU for Transformers in practice
The compression (14336 → 4096) forces the network to keep only what matters. It’s a bottleneck that acts as a filter: out of the 14336 activated dimensions, only the most relevant information survives the projection back.
The FFN as Associative Memory
Here’s where it gets fascinating: the FFN weights aren’t just “parameters.” They’re an associative memory encoded during training.
Take an example:
Input: representation of "Paris is the capital of..."
│
▼
FFN activates specific neurons that "know" that:
• Paris → France
• Capital → main city
• European geographical context
│
▼
Output: enriched representation → prediction "France"
Research (Geva et al., 2021) has shown that you can locate specific facts in FFN neurons. Certain neurons fire only for very precise patterns:
- Neuron 3847, layer 24: fires for “the Eiffel Tower is in…” → boosts “Paris”
- Neuron 12903, layer 31: fires for Python code patterns
- Neuron 7291, layer 18: fires for French grammatical structures
That’s why model editing techniques (like ROME — Meng et al., 2022) can modify facts by changing a few FFN weights. You can literally make the model believe “the Eiffel Tower is in London” by editing the right matrices.
Gated FFN: The Modern Version
Modern Transformers (Llama, Mistral, Gemma) use a variant called SwiGLU (Gated Linear Unit with SiLU activation). Instead of two matrices, there are three:
input (4096)
│
├──→ [Gate] 4096 → 14336 ──→ SiLU(gate)
│ │
└──→ [Up] 4096 → 14336 ─────── × (element-wise multiplication)
│
▼
[Down] 14336 → 4096
│
▼
output (4096)
The gate learns to select which dimensions are relevant. It’s like an internal attention mechanism within the FFN itself: “for this token, activate the code neurons; for that one, activate the natural language neurons.”
In practice, SwiGLU consistently outperforms the classic FFN, at a slightly higher parameter cost (~50% more in the FFN, but the quality gain more than compensates).
The Duo in Action: Layer by Layer
Let’s see how the 80 layers of a Transformer process a full request:
"Write a Python function that sorts a list"
Layers 1-10 (low):
Attention → identifies syntactic structure
FFN → encodes tokens into basic concepts
Layers 11-40 (middle):
Attention → connects "Python function" + "sorts" + "list"
FFN → activates knowledge: sorting algorithms, Python syntax,
coding conventions, data types
Layers 41-70 (high):
Attention → organizes the output sequence
FFN → refines the choice: sorted() vs .sort() vs manual implementation,
complexity level appropriate to context
Layers 71-80 (final):
Attention → resolves final dependencies
FFN → projects to vocabulary → next token prediction
Each layer refines a bit more. Low layers capture syntax, middle layers encode semantics, and high layers handle generation. It’s hierarchical processing — from concrete to abstract, then back to concrete.
Attention vs FFN: The Numbers
Here’s the actual breakdown in common models:
| Attention | FFN | Total | |
|---|---|---|---|
| Llama 3 8B | ~2.1B (26%) | ~5.6B (70%) | 8B |
| Llama 3 70B | ~17B (24%) | ~49B (70%) | 70B |
| Mistral 7B | ~1.7B (25%) | ~4.8B (71%) | 6.7B |
The FFN accounts for roughly 2/3 of the model’s parameters. That makes sense: it’s where knowledge is stored. Attention is a relatively lightweight mechanism — it “only” needs to know how to connect tokens.
| Attention | FFN | |
|---|---|---|
| Role | Connect tokens to each other | Transform/enrich each token |
| Operation | Token ↔ Token (inter-token) | Token → Token (intra-token) |
| Analogy | Eyes reading the context | Brain thinking |
| Parameters | ~1/3 of the model | ~2/3 of the model |
| Stores | Relational patterns | Facts, knowledge, transformations |
| Nature | Primarily linear | Non-linear (this is where it “thinks”) |
Why Bigger Models Are Better (and Why That’s a Problem)
Before we get to MoE, we need to understand the fundamental tension that MoE solves.
More Parameters = More Knowledge
A model’s size — its parameter count — is decided before training begins. It’s an architecture choice, not something that grows during training. Think of it like building a warehouse:
- More parameters = a bigger warehouse (more shelves)
- Training = filling the shelves with knowledge
- You don’t add shelves while stocking — you decide the warehouse size upfront, then fill it
"We want a better model"
│
▼
Design a bigger network
- more layers (32 → 80)
- wider dimensions (4096 → 8192)
- larger FFN (14336 → 28672)
│
▼
Initialize weights randomly
(billions of random numbers)
│
▼
Train for weeks on massive data
(weights adjust to encode knowledge)
│
▼
Final model with frozen weights
Each parameter is a number (a weight) in a matrix. More of them means:
- More storage capacity for facts (in the FFN)
- More nuance in relationships (in the attention)
- More room to encode complex patterns
This is remarkably predictable. The scaling laws (Kaplan et al., 2020) showed that model quality improves as a smooth power law with parameter count. Double the parameters, get a predictable quality bump. It barely plateaus — hence the race to build bigger models.
Llama 8B → 8 billion weights → good at general tasks
Llama 70B → 70 billion weights → better reasoning, languages, code
Llama 405B → 405 billion weights → even better, more nuanced
The Scaling Problem
Here’s the catch: in a dense model, every parameter is activated for every single token. Bigger model = proportionally more compute per token.
Dense 13B → 13B FLOPs/token → decent quality
Dense 70B → 70B FLOPs/token → great quality, but 5× slower and costlier
Dense 405B → 405B FLOPs/token → best quality, but 30× slower
You want the knowledge of a 70B model but the speed of a 13B. That sounds impossible — unless you find a way to have a big warehouse but only turn on the lights in the aisles you need. That’s exactly what MoE does.
And That’s Why MoE Replaces the FFN
Now that we understand each component’s role and the scaling problem, the Mixture of Experts (MoE) architecture becomes crystal clear.
The Observation
- Attention must stay shared → all tokens need to see each other. No shortcut possible.
- The FFN is memory / reasoning → it can be specialized.
The MoE Idea
Instead of one big FFN, use several (the “experts”) and a small routing network chooses which ones to activate:
Token
│
▼
┌── Router ───┐
│ (gating) │
▼ ▼
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐
│Expert 0 │ │Expert 1 │ │Expert 2 │ │ ... N │
│ (FFN) │ │ (FFN) │ │ (FFN) │ │ (FFN) │
└────┬────┘ └────┬────┘ └────┬────┘ └─────────┘
│ │
▼ ▼
Weighted sum (top-K experts)
│
▼
Enriched token
Mixtral 8x7B has 8 experts per layer and activates 2 per token. Result: 46.7B total parameters (the knowledge of 8 experts), but only ~12.9B activated per token (2 out of 8). Best of both worlds: the capacity of a large model, the inference cost of a small one.
The Company Analogy
Think of it like a consulting firm:
- Dense model = every employee works on every project. 100 employees = 100× the labor cost per project. Thorough, but absurdly expensive.
- MoE model = you have 100 specialists but only call in the 10 most relevant ones per project. 100 people worth of knowledge, 10 people worth of cost.
The warehouse version: MoE builds a massive warehouse (47B parameters of shelf space) but only turns on the lights in the 2 aisles each customer actually needs (12.9B active). The knowledge is all there — you just don’t pay to illuminate the whole building for every query.
Dense 13B → 13B compute → quality: ★★★☆☆
Dense 70B → 70B compute → quality: ★★★★★
MoE 8×7B → 13B compute → quality: ★★★★☆ ← the sweet spot
The real win isn’t raw speed — it’s more intelligence per dollar. Google showed with Switch Transformer that MoE reaches the same quality 4-7× faster during training. At inference, you get near-70B quality for near-13B cost. That’s why GPT-4 is widely believed to be a MoE architecture.
Specialization Emerges Naturally
During training, experts specialize without explicit supervision:
"def train_model(epochs=10):"
│
▼ Attention (understands the structure)
▼ Router → Expert 3 (code) + Expert 7 (ML)
▼ The experts "know" this is Python ML
"The cat sleeps peacefully on the couch"
│
▼ Attention (understands subject/verb/place relations)
▼ Router → Expert 1 (language) + Expert 5 (general)
▼ The experts "know" this is everyday natural language
"SELECT u.name FROM users u JOIN orders o ON u.id = o.user_id"
│
▼ Attention (understands SQL relations)
▼ Router → Expert 3 (code) + Expert 6 (data)
▼ The experts "know" this is relational SQL
Analyses of Mixtral show that certain experts do specialize by domain (code, math, specific languages), even though the specialization isn’t as clean as a human-designed partition — it’s more of a statistical distribution than a clean separation.
Why Attention Stays Shared (and Not the Other Way Around)
You could imagine “attention experts” and a shared FFN. But it doesn’t work well, and the reason is architectural:
-
Attention is relational — it needs to see all tokens to establish the right connections. Specializing attention would make the model blind to certain relationships.
-
The FFN is independent per token — each token passes through the FFN separately. It’s naturally parallelizable and specializable. The router can send each token to the most relevant expert without affecting other tokens.
-
The parameter ratio — the FFN is ~2/3 of the model. That’s where you gain the most by multiplying experts. Multiplying attention (1/3 of the model) would be less efficient.
Practical Implications for Engineers
Understanding this architecture has direct consequences:
Fine-Tuning with LoRA
LoRA (Low-Rank Adaptation) injects adaptation matrices into the model. Now you know where to put them:
-
LoRA on attention (Q, K, V, O) → adapts how the model connects tokens. Useful for tasks requiring new relational structures (summarization, translation).
-
LoRA on the FFN (gate, up, down) → adapts knowledge and reasoning. Useful for injecting domain knowledge (medical terminology, internal code conventions).
-
Both → the most common approach in practice. Adapt both the gaze AND the thinking.
Pruning and Distillation
If you need to compress a model:
- Reducing attention heads (from 32 to 16) → loses relational nuance but keeps knowledge
- Reducing FFN dimension (from 14336 to 8192) → loses stored knowledge but keeps structural understanding
Both hurt, but differently. The choice depends on your task.
Interpretability
To understand what a model knows → analyze the FFN (which neurons fire for which concepts).
To understand how it reasons on an input → analyze the attention patterns (who looks at whom, in which layers).
Summary
The Transformer is a duo:
-
Attention is the eyes 👀 — it reads, connects, and contextualizes. “Who is related to what in this sequence?”
-
The FFN is the brain 🧠 — it thinks, transforms, and generates. “What does it mean and what comes next?”
And it’s precisely this separation that makes MoE possible: keep the eyes shared (everyone needs to see the same context) and specialize the brains (each domain gets its own expert).
Next time someone tells you “Attention is All You Need,” you can reply: “Attention is one-third of the model. The FFN does the real work.” 🧠
Further reading:
- Transformer Feed-Forward Layers Are Key-Value Memories — Geva et al., 2021
- Locating and Editing Factual Associations in GPT — Meng et al., 2022 (ROME)
- Attention Is All You Need — Vaswani et al., 2017
- GLU Variants Improve Transformer — Shazeer, 2020 (SwiGLU)
- Scaling Laws for Neural Language Models — Kaplan et al., 2020
- Switch Transformers: Scaling to Trillion Parameter Models — Fedus et al., 2021
- Mixtral of Experts — Jiang et al., 2024
Related Posts
LLM Architecture Explained Simply: 10 Questions From Prompt to Token
A beginner-friendly walkthrough of how an LLM actually works end-to-end: from typing a prompt to receiving a response — covering tokenization, embeddings, Transformer layers, KV cache, the training loop, embeddings for search, and why decoder-only models won.
AILLM Distillation vs Quantization: Making Models Smaller, Smarter, Cheaper
Two strategies to shrink LLMs — one compresses weights, the other transfers knowledge. A practical guide to distillation and quantization: when to use each, how to implement them with Hugging Face, and why the real answer is both.
AITFLOPS: The GPU Metric Every AI Engineer Should Understand
What TFLOPS actually measures, why FP16 matters for LLMs, and why the most important GPU bottleneck for inference isn't compute at all.
