Transformer Anatomy: Attention + FFN Demystified

A deep dive into the Transformer architecture — how attention connects tokens and why the Feed-Forward Network is the real brain of the model. Plus the key to understanding Mixture of Experts (MoE).

Alexandre Agius

AWS Solutions Architect

February 23, 2026 15 min read

AI Transformer Attention FFN MoE Deep Learning LLM Architecture

When people talk about Transformers, everyone cites attention. “Attention is All You Need,” the 2017 paper, cemented the idea. But in reality, attention is only half the story — and honestly, not the more interesting half.

The real brain of the Transformer is the Feed-Forward Network (FFN). And understanding the Attention + FFN duo is understanding why these models work — and why modern architectures like Mixture of Experts (MoE) make the design choices they do.

This article takes apart a Transformer block piece by piece. No intimidating formulas, no unnecessary jargon. If you know what an LLM is, you’ll follow everything.

The Transformer Block: Big Picture

A model like Llama 3 70B stacks 80 identical blocks. Each block does the same thing:

Input token (8192-dim vector)
        │
        ▼
┌─────────────────────┐
│   Self-Attention     │  ← "Who is related to what?"
│   (Multi-Head)       │
└─────────┬───────────┘
          │ + residual
          ▼
┌─────────────────────┐
│   Feed-Forward       │  ← "What does it mean?"
│   Network (FFN)      │
└─────────┬───────────┘
          │ + residual
          ▼
Output token (8192-dim vector)

That’s it. 80 times. Each pass enriches the token’s representation a little more, layer by layer, until the model has enough understanding to predict the next token.

But what do these two components actually do?

Self-Attention: The Model’s Eyes 👀

The Intuition

Take this sentence:

“The developer who uses vLLM deployed his model on AWS”

To understand this sentence, you need to resolve relationships:

“his” → refers to “the developer” (not vLLM, not AWS)
“who uses vLLM” → qualifies “the developer”
“his model” → the developer’s model, deployed on AWS

That’s exactly what attention does: for each token, it looks at all other tokens and computes how relevant each one is. It’s relational pattern matching.

The Q/K/V Mechanism

Under the hood, each token is transformed into three vectors:

Query (Q) — “What am I looking for?”
Key (K) — “What do I contain?”
Value (V) — “If you find me relevant, here’s my content”

Token "his"
  Q = "I'm looking for the possessor of this pronoun"
  
Token "developer"
  K = "I'm a human subject, agent of an action"
  V = [rich representation of the concept "developer"]

Score("his" → "developer") = Q_his · K_developer = HIGH ✓
Score("his" → "AWS")       = Q_his · K_AWS       = LOW  ✗

The attention score is a simple dot product between Q and K. The more two vectors point in the same direction, the higher the score. After a softmax to normalize, each token gets a weighted average of the Values from all other tokens.

Multi-Head: Looking from Multiple Angles

A Transformer doesn’t have one attention but several in parallel (the “heads”). Llama 3 70B has 64 per layer.

Why? Because a word can be relevant to another in multiple ways:

Head 1: syntactic relations    → "his" ← "developer" (subject)
Head 2: semantic relations     → "model" ← "ML/AI" (domain)
Head 3: positional relations   → nearby tokens to each other
Head 4: co-reference           → "his" ← "developer" (same entity)
...

Each head has its own Q, K, V matrices. They independently learn to capture different types of relationships. The results are concatenated and projected into the output.

What Attention Does NOT Do

Here’s the crucial point: attention connects tokens to each other, but it doesn’t truly transform the information. It’s an essentially linear operation — matrix products and a weighted average. It answers “who looks at whom,” not “what does it mean.”

For that, you need the FFN.

Feed-Forward Network: The Model’s Brain 🧠

The Intuition

Let’s go back to our sentence. Attention has understood the relationships:

“his” → “developer”
“vLLM” ← “uses”
“model” → “deployed” → “AWS”

Now the FFN thinks about this information:

“Ok, a developer + vLLM + deployment + AWS → MLOps context / production inference. The next likely word relates to infrastructure, serving, latency…”

The FFN transforms relational understanding into reasoning. This is where the model “thinks.”

The Architecture: Expansion → Activation → Compression

The FFN is surprisingly simple. Two linear layers with an activation in between:

input (4096 dims)
    │
    ▼
[Projection UP]   4096 → 14336    ← expansion ×3.5
    │
    ▼
[SiLU Activation]                  ← non-linearity
    │
    ▼
[Projection DOWN]  14336 → 4096   ← compression
    │
    ▼
output (4096 dims)

Dimensions vary by model (here Llama 3 8B), but the ratio is typically between ×2.7 and ×4.

Why Expand Then Compress?

It’s an inverted bottleneck, and there’s a deep reason for this design.

The expansion (4096 → 14336) projects the token into a much larger space. Intuitively, it gives the model “more room to think.” In this expanded space, individual dimensions can encode very specific concepts — one neuron for Python code, another for French grammar, another for historical dates.

The SiLU activation is the secret ingredient. Without non-linearity, stacking linear layers is equivalent to a single matrix multiplication — no reasoning possible. The non-linearity allows the network to learn complex transformations:

# SiLU (Sigmoid Linear Unit) = x * sigmoid(x)
def silu(x):
    return x * torch.sigmoid(x)

# Smooth, allows negative gradients
# Better than ReLU for Transformers in practice

The compression (14336 → 4096) forces the network to keep only what matters. It’s a bottleneck that acts as a filter: out of the 14336 activated dimensions, only the most relevant information survives the projection back.

The FFN as Associative Memory

Here’s where it gets fascinating: the FFN weights aren’t just “parameters.” They’re an associative memory encoded during training.

Take an example:

Input: representation of "Paris is the capital of..."
                    │
                    ▼
FFN activates specific neurons that "know" that:
  • Paris → France
  • Capital → main city
  • European geographical context
                    │
                    ▼
Output: enriched representation → prediction "France"

Research (Geva et al., 2021) has shown that you can locate specific facts in FFN neurons. Certain neurons fire only for very precise patterns:

Neuron 3847, layer 24: fires for “the Eiffel Tower is in…” → boosts “Paris”
Neuron 12903, layer 31: fires for Python code patterns
Neuron 7291, layer 18: fires for French grammatical structures

That’s why model editing techniques (like ROME — Meng et al., 2022) can modify facts by changing a few FFN weights. You can literally make the model believe “the Eiffel Tower is in London” by editing the right matrices.

Gated FFN: The Modern Version

Modern Transformers (Llama, Mistral, Gemma) use a variant called SwiGLU (Gated Linear Unit with SiLU activation). Instead of two matrices, there are three:

input (4096)
    │
    ├──→ [Gate]  4096 → 14336  ──→ SiLU(gate)
    │                                   │
    └──→ [Up]    4096 → 14336  ─────── × (element-wise multiplication)
                                        │
                                        ▼
                                [Down] 14336 → 4096
                                        │
                                        ▼
                                output (4096)

The gate learns to select which dimensions are relevant. It’s like an internal attention mechanism within the FFN itself: “for this token, activate the code neurons; for that one, activate the natural language neurons.”

In practice, SwiGLU consistently outperforms the classic FFN, at a slightly higher parameter cost (~50% more in the FFN, but the quality gain more than compensates).

The Duo in Action: Layer by Layer

Let’s see how the 80 layers of a Transformer process a full request:

"Write a Python function that sorts a list"

Layers 1-10 (low):
  Attention → identifies syntactic structure
  FFN → encodes tokens into basic concepts
  
Layers 11-40 (middle):
  Attention → connects "Python function" + "sorts" + "list"
  FFN → activates knowledge: sorting algorithms, Python syntax,
        coding conventions, data types
  
Layers 41-70 (high):
  Attention → organizes the output sequence
  FFN → refines the choice: sorted() vs .sort() vs manual implementation,
        complexity level appropriate to context
  
Layers 71-80 (final):
  Attention → resolves final dependencies
  FFN → projects to vocabulary → next token prediction

Each layer refines a bit more. Low layers capture syntax, middle layers encode semantics, and high layers handle generation. It’s hierarchical processing — from concrete to abstract, then back to concrete.

Attention vs FFN: The Numbers

Here’s the actual breakdown in common models:

	Attention	FFN	Total
Llama 3 8B	~2.1B (26%)	~5.6B (70%)	8B
Llama 3 70B	~17B (24%)	~49B (70%)	70B
Mistral 7B	~1.7B (25%)	~4.8B (71%)	6.7B

The FFN accounts for roughly 2/3 of the model’s parameters. That makes sense: it’s where knowledge is stored. Attention is a relatively lightweight mechanism — it “only” needs to know how to connect tokens.

	Attention	FFN
Role	Connect tokens to each other	Transform/enrich each token
Operation	Token ↔ Token (inter-token)	Token → Token (intra-token)
Analogy	Eyes reading the context	Brain thinking
Parameters	~1/3 of the model	~2/3 of the model
Stores	Relational patterns	Facts, knowledge, transformations
Nature	Primarily linear	Non-linear (this is where it “thinks”)

Why Bigger Models Are Better (and Why That’s a Problem)

Before we get to MoE, we need to understand the fundamental tension that MoE solves.

More Parameters = More Knowledge

A model’s size — its parameter count — is decided before training begins. It’s an architecture choice, not something that grows during training. Think of it like building a warehouse:

More parameters = a bigger warehouse (more shelves)
Training = filling the shelves with knowledge
You don’t add shelves while stocking — you decide the warehouse size upfront, then fill it

"We want a better model"
         │
         ▼
Design a bigger network
  - more layers (32 → 80)
  - wider dimensions (4096 → 8192)
  - larger FFN (14336 → 28672)
         │
         ▼
Initialize weights randomly
  (billions of random numbers)
         │
         ▼
Train for weeks on massive data
  (weights adjust to encode knowledge)
         │
         ▼
Final model with frozen weights

Each parameter is a number (a weight) in a matrix. More of them means:

More storage capacity for facts (in the FFN)
More nuance in relationships (in the attention)
More room to encode complex patterns

This is remarkably predictable. The scaling laws (Kaplan et al., 2020) showed that model quality improves as a smooth power law with parameter count. Double the parameters, get a predictable quality bump. It barely plateaus — hence the race to build bigger models.

Llama 8B   → 8 billion weights   → good at general tasks
Llama 70B  → 70 billion weights  → better reasoning, languages, code
Llama 405B → 405 billion weights → even better, more nuanced

The Scaling Problem

Here’s the catch: in a dense model, every parameter is activated for every single token. Bigger model = proportionally more compute per token.

Dense 13B  → 13B FLOPs/token  → decent quality
Dense 70B  → 70B FLOPs/token  → great quality, but 5× slower and costlier
Dense 405B → 405B FLOPs/token → best quality, but 30× slower

You want the knowledge of a 70B model but the speed of a 13B. That sounds impossible — unless you find a way to have a big warehouse but only turn on the lights in the aisles you need. That’s exactly what MoE does.

And That’s Why MoE Replaces the FFN

Now that we understand each component’s role and the scaling problem, the Mixture of Experts (MoE) architecture becomes crystal clear.

The Observation

Attention must stay shared → all tokens need to see each other. No shortcut possible.
The FFN is memory / reasoning → it can be specialized.

The MoE Idea

Instead of one big FFN, use several (the “experts”) and a small routing network chooses which ones to activate:

                Token
                  │
                  ▼
         ┌── Router ───┐
         │   (gating)   │
         ▼              ▼
    ┌─────────┐   ┌─────────┐   ┌─────────┐   ┌─────────┐
    │Expert 0 │   │Expert 1 │   │Expert 2 │   │ ... N   │
    │ (FFN)   │   │ (FFN)   │   │ (FFN)   │   │ (FFN)   │
    └────┬────┘   └────┬────┘   └────┬────┘   └─────────┘
         │              │
         ▼              ▼
      Weighted sum (top-K experts)
                  │
                  ▼
            Enriched token

Mixtral 8x7B has 8 experts per layer and activates 2 per token. Result: 46.7B total parameters (the knowledge of 8 experts), but only ~12.9B activated per token (2 out of 8). Best of both worlds: the capacity of a large model, the inference cost of a small one.

The Company Analogy

Think of it like a consulting firm:

Dense model = every employee works on every project. 100 employees = 100× the labor cost per project. Thorough, but absurdly expensive.
MoE model = you have 100 specialists but only call in the 10 most relevant ones per project. 100 people worth of knowledge, 10 people worth of cost.

The warehouse version: MoE builds a massive warehouse (47B parameters of shelf space) but only turns on the lights in the 2 aisles each customer actually needs (12.9B active). The knowledge is all there — you just don’t pay to illuminate the whole building for every query.

Dense 13B  → 13B compute  → quality: ★★★☆☆
Dense 70B  → 70B compute  → quality: ★★★★★
MoE 8×7B   → 13B compute  → quality: ★★★★☆  ← the sweet spot

The real win isn’t raw speed — it’s more intelligence per dollar. Google showed with Switch Transformer that MoE reaches the same quality 4-7× faster during training. At inference, you get near-70B quality for near-13B cost. That’s why GPT-4 is widely believed to be a MoE architecture.

Specialization Emerges Naturally

During training, experts specialize without explicit supervision:

"def train_model(epochs=10):"
    │
    ▼ Attention (understands the structure)
    ▼ Router → Expert 3 (code) + Expert 7 (ML)
    ▼ The experts "know" this is Python ML

"The cat sleeps peacefully on the couch"
    │
    ▼ Attention (understands subject/verb/place relations)
    ▼ Router → Expert 1 (language) + Expert 5 (general)
    ▼ The experts "know" this is everyday natural language

"SELECT u.name FROM users u JOIN orders o ON u.id = o.user_id"
    │
    ▼ Attention (understands SQL relations)
    ▼ Router → Expert 3 (code) + Expert 6 (data)
    ▼ The experts "know" this is relational SQL

Analyses of Mixtral show that certain experts do specialize by domain (code, math, specific languages), even though the specialization isn’t as clean as a human-designed partition — it’s more of a statistical distribution than a clean separation.

Why Attention Stays Shared (and Not the Other Way Around)

You could imagine “attention experts” and a shared FFN. But it doesn’t work well, and the reason is architectural:

Attention is relational — it needs to see all tokens to establish the right connections. Specializing attention would make the model blind to certain relationships.
The FFN is independent per token — each token passes through the FFN separately. It’s naturally parallelizable and specializable. The router can send each token to the most relevant expert without affecting other tokens.
The parameter ratio — the FFN is ~2/3 of the model. That’s where you gain the most by multiplying experts. Multiplying attention (1/3 of the model) would be less efficient.

Practical Implications for Engineers

Understanding this architecture has direct consequences:

Fine-Tuning with LoRA

LoRA (Low-Rank Adaptation) injects adaptation matrices into the model. Now you know where to put them:

LoRA on attention (Q, K, V, O) → adapts how the model connects tokens. Useful for tasks requiring new relational structures (summarization, translation).
LoRA on the FFN (gate, up, down) → adapts knowledge and reasoning. Useful for injecting domain knowledge (medical terminology, internal code conventions).
Both → the most common approach in practice. Adapt both the gaze AND the thinking.

Pruning and Distillation

If you need to compress a model:

Reducing attention heads (from 32 to 16) → loses relational nuance but keeps knowledge
Reducing FFN dimension (from 14336 to 8192) → loses stored knowledge but keeps structural understanding

Both hurt, but differently. The choice depends on your task.

Interpretability

To understand what a model knows → analyze the FFN (which neurons fire for which concepts).

To understand how it reasons on an input → analyze the attention patterns (who looks at whom, in which layers).

Summary

The Transformer is a duo:

Attention is the eyes 👀 — it reads, connects, and contextualizes. “Who is related to what in this sequence?”
The FFN is the brain 🧠 — it thinks, transforms, and generates. “What does it mean and what comes next?”

And it’s precisely this separation that makes MoE possible: keep the eyes shared (everyone needs to see the same context) and specialize the brains (each domain gets its own expert).

Next time someone tells you “Attention is All You Need,” you can reply: “Attention is one-third of the model. The FFN does the real work.” 🧠

Further reading:

Transformer Feed-Forward Layers Are Key-Value Memories — Geva et al., 2021
Locating and Editing Factual Associations in GPT — Meng et al., 2022 (ROME)
Attention Is All You Need — Vaswani et al., 2017
GLU Variants Improve Transformer — Shazeer, 2020 (SwiGLU)
Scaling Laws for Neural Language Models — Kaplan et al., 2020
Switch Transformers: Scaling to Trillion Parameter Models — Fedus et al., 2021
Mixtral of Experts — Jiang et al., 2024

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

LLM Architecture Explained Simply: 10 Questions From Prompt to Token

A beginner-friendly walkthrough of how an LLM actually works end-to-end: from typing a prompt to receiving a response — covering tokenization, embeddings, Transformer layers, KV cache, the training loop, embeddings for search, and why decoder-only models won.

Feb 26, 2026 AI

LLM Distillation vs Quantization: Making Models Smaller, Smarter, Cheaper

Two strategies to shrink LLMs — one compresses weights, the other transfers knowledge. A practical guide to distillation and quantization: when to use each, how to implement them with Hugging Face, and why the real answer is both.

Feb 25, 2026 AI

TFLOPS: The GPU Metric Every AI Engineer Should Understand

What TFLOPS actually measures, why FP16 matters for LLMs, and why the most important GPU bottleneck for inference isn't compute at all.

Feb 24, 2026

The Transformer Block: Big Picture

Self-Attention: The Model’s Eyes 👀

The Intuition

The Q/K/V Mechanism

Multi-Head: Looking from Multiple Angles

What Attention Does NOT Do

Feed-Forward Network: The Model’s Brain 🧠

The Intuition

The Architecture: Expansion → Activation → Compression

Why Expand Then Compress?

The FFN as Associative Memory

Gated FFN: The Modern Version

The Duo in Action: Layer by Layer

Attention vs FFN: The Numbers

Why Bigger Models Are Better (and Why That’s a Problem)

More Parameters = More Knowledge

The Scaling Problem

And That’s Why MoE Replaces the FFN

The Observation

The MoE Idea

The Company Analogy

Specialization Emerges Naturally

Why Attention Stays Shared (and Not the Other Way Around)

Practical Implications for Engineers

Fine-Tuning with LoRA

Pruning and Distillation

Interpretability

Summary

Alexandre Agius

Related Posts

LLM Architecture Explained Simply: 10 Questions From Prompt to Token

LLM Distillation vs Quantization: Making Models Smaller, Smarter, Cheaper

TFLOPS: The GPU Metric Every AI Engineer Should Understand