AI

LLM Architecture Explained Simply: 10 Questions From Prompt to Token

A beginner-friendly walkthrough of how an LLM actually works end-to-end: from typing a prompt to receiving a response — covering tokenization, embeddings, Transformer layers, KV cache, the training loop, embeddings for search, and why decoder-only models won.

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

16 min read
Share:

You type a prompt, hit enter, and tokens start streaming back. But what actually happens between your text and the model’s output? This post answers 10 simple questions that trace the complete path — from raw text to generated token — with no math prerequisites.

The Problem

Every explanation of LLMs either starts with the attention formula (losing 90% of readers) or stays so high-level it’s useless for practical work. You end up knowing that “Transformers use attention” without understanding what that means in practice.

Here are 10 questions that, once answered, give you a solid mental model of how LLMs work:

  1. How does an LLM work?
  2. How does inference work?
  3. How does training work?
  4. What do Attention and FFN actually do?
  5. How does the model know when to stop?
  6. What’s a token? How does vocabulary work?
  7. What’s the KV cache?
  8. What are embeddings? How are they used for search?
  9. What’s the difference between encoder and decoder?
  10. What’s decided before training vs learned during training?

If you deploy, fine-tune, or evaluate LLMs, these aren’t academic questions — they directly affect latency, memory usage, cost, and output quality.

The Solution

The entire LLM pipeline is a sequence of six stages. Every model — GPT-4, Claude, Mistral, Llama — follows this exact flow:

Prompt → Tokenizer → Embeddings → Transformer Stack → LM Head → Output Token

                                                              (loop until EOS)

End-to-end LLM pipeline from user prompt through tokenization, embedding, Transformer layers with KV cache, LM head sampling, to output token with autoregressive loop

Each stage transforms the data into a different representation. Understanding these representations is understanding LLMs.

How It Works

Stage 1: Tokenization — Text to Numbers

LLMs don’t see text. They see sequences of integers. The tokenizer converts text into token IDs using a pre-built vocabulary.

Most modern models use Byte-Pair Encoding (BPE): start with individual characters, then iteratively merge the most frequent pairs until you reach your target vocabulary size (typically 32K-128K entries).

Input:  "The capital of France is"


Tokenize: ["The", " capital", " of", " France", " is"]


Lookup:   [464, 4891, 315, 6629, 374]

Key properties of tokenization:

PropertyDetail
Vocabulary size32K (Mistral), 128K (Llama 3), 100K (GPT-4)
AlgorithmBPE — byte-level, merge frequent pairs
GranularitySubword — common words are one token, rare words split into pieces
Language efficiencyEnglish ≈ 1 token per word. French ≈ 1.3-1.5x more tokens. Chinese ≈ 2x.
ReversibleAlways — you can decode token IDs back to text

Why this matters practically:

  • Pricing — API providers charge per token, not per word. French prompts cost ~30-50% more than equivalent English ones.
  • Context window — a “128K context” means 128K tokens, not characters. English gets ~100K words, while code gets significantly less (variable names, punctuation all consume tokens).
  • Vocabulary trade-off — bigger vocab = fewer tokens per text (cheaper, faster) but larger embedding matrix (more memory). Llama 3 quadrupled vocab from 32K to 128K specifically to improve multilingual and code efficiency.

Stage 2: Embeddings — Numbers to Meaning

Token IDs are just integers — they carry no semantic information. The embedding layer converts each ID into a dense vector that captures meaning.

Token ID 6629 ("France")


Embedding matrix lookup (128K × 4096)


[0.023, -0.451, 0.892, ..., 0.117]  ← 4096-dimensional vector

This vector is learned during training. Similar concepts end up near each other in this space — “France” and “Germany” have similar vectors, while “France” and “function” are far apart.

Positional encoding is added to give the model a sense of order. The model needs to know that “dog bites man” means something different from “man bites dog.” Modern models use RoPE (Rotary Position Embedding) — a rotation applied to the vector that encodes relative position information.

After this stage, the input is a matrix: [sequence_length × model_dimension]. For our 5-token prompt on a 4096-dim model, that’s [5 × 4096].

Embeddings beyond LLMs: vectors for search

The same concept of “text as vectors” powers embedding models — specialized models trained not to predict the next token, but to map text into vectors where similar meanings are close together. Unlike LLM embeddings (which are an internal layer), embedding models produce a single vector per input text, optimized for similarity search.

"The cat sat on the mat"          → [0.82, 0.15, -0.33, ...]
"A feline rested on a rug"        → [0.79, 0.18, -0.30, ...]  ← close (similar meaning)
"Stock prices rose sharply today"  → [-0.45, 0.91, 0.12, ...]  ← far (different meaning)

These vectors enable RAG (Retrieval-Augmented Generation): embed a query, search a vector database (FAISS, Pinecone, OpenSearch) for the closest document vectors using cosine similarity, then pass the retrieved text (not the vectors) to the LLM as context. The embedding is only used for search — the LLM reads plain text.

1. Query: "How do cats behave?"     → embed → vector → search FAISS
2. FAISS returns: document IDs [0, 2]  (not text, not vectors — just IDs)
3. You look up: texts[0] → "Felines are independent animals..."
4. Prompt to LLM: "Context: Felines are independent animals...
                    Question: How do cats behave?"
5. LLM generates answer from the context

Stage 3: The Transformer Stack — Where Thinking Happens

This is the core of the model. The embedding vectors pass through a stack of identical Transformer blocks — 32 layers for Mistral 7B, 80 for Llama 70B. Each block does two things:

  1. Self-Attention — “Who is related to what?” Looks at all tokens in the sequence and computes how much each token should influence every other token. This is where “France” connects to “capital” and “is.”

  2. Feed-Forward Network (FFN) — “What does it mean?” Transforms each token’s representation through learned knowledge. This is where the model “remembers” that the capital of France is Paris. The FFN holds ~2/3 of the model’s parameters and acts as an associative memory.

Each layer refines the representation a bit more:

Layers 1-10:   Syntactic structure (grammar, word order)
Layers 11-50:  Semantic understanding (meaning, facts, relationships)
Layers 51-80:  Generation preparation (what token comes next)

For a deep dive into the Attention + FFN duo, see Transformer Anatomy: Attention + FFN Demystified.

Stage 4: The LM Head — Vectors to Probabilities

After the final Transformer layer, each token position has a 4096-dimensional vector that encodes “everything the model understands about what should come next.” The LM head converts this into a probability over the entire vocabulary:

Final hidden state: [4096-dim vector]


Linear projection: 4096 → 128,000 (vocab size)


Softmax: convert raw scores to probabilities


{ "Paris": 0.87, "Lyon": 0.03, "the": 0.02, "Berlin": 0.01, ... }

This gives a probability for every token in the vocabulary. The model doesn’t “pick a word” — it produces a distribution over all possible next tokens.

Stage 5: Sampling — Choosing the Next Token

The sampling strategy controls how the model selects from that probability distribution. This is where temperature, top-k, and top-p come in:

ParameterWhat It DoesEffect
TemperatureScales the logits before softmax. T < 1 sharpens (more deterministic), T > 1 flattens (more creative).T=0: always picks highest probability. T=1: natural distribution. T=2: very random.
Top-kOnly consider the top K most probable tokens.k=1: greedy. k=50: considers 50 options.
Top-p (nucleus)Only consider tokens whose cumulative probability reaches P.p=0.9: considers the smallest set of tokens covering 90% probability mass.

In practice, most serving setups use temperature=0.7 with top_p=0.9 for a balance of coherence and variety. Coding tasks use lower temperature (0.1-0.3) for more deterministic output.

The selected token is appended to the sequence, and we loop back to Stage 3.

Stage 6: The Autoregressive Loop — Token by Token

LLMs generate text one token at a time. The selected token is fed back as input, and the entire Transformer stack runs again to produce the next token. This continues until either:

  • The model outputs an EOS (End of Sequence) token — a special vocabulary entry that signals “I’m done”
  • The sequence hits the maximum length limit
  • The user/system stops generation
Step 1: "The capital of France is" → model predicts "Paris"
Step 2: "The capital of France is Paris" → model predicts "."
Step 3: "The capital of France is Paris." → model predicts EOS
→ Done. Return "Paris."

This is why LLMs can’t “plan ahead” — each token is generated based solely on what came before. The model doesn’t see the full response, draft it, then output it. It commits to each token as it goes.

Prefill vs Decode: Why the First Token Is Slow

Inference has two distinct phases:

Prefill — Process the entire input prompt in one forward pass. All input tokens are processed in parallel (they’re all known). This is compute-heavy but efficient because of parallelism. The KV cache is populated for all input tokens.

Decode — Generate output tokens one at a time. Each step processes only the new token, reading the KV cache for all previous tokens. This is memory-bandwidth-bound — the GPU spends most of its time reading cached values.

Prefill:  [████████████████████]  ← all input tokens, one pass
                                     Time: 200ms (for 1000 input tokens)

Decode:   [█][█][█][█][█][█]...   ← one token per step
                                     Time: 30ms per token

This is why TTFT (Time to First Token) is dominated by prompt length. A 100-token prompt has fast TTFT. A 10,000-token RAG prompt has much higher TTFT, regardless of output length. After prefill, decode speed is roughly constant.

The KV Cache: Memory That Makes Decode Fast

During prefill, the Attention mechanism computes keys and values for every input token. These are stored in the KV cache — a per-request memory buffer that lives in GPU memory. During decode, each new token only needs to compute its own key/value pair and attend to the cached ones, avoiding recomputation of the entire sequence.

Prefill: "What is the capital of France?"
         → Compute K,V for all 7 tokens → Store in KV cache

Decode step 1: "The"
         → Compute K,V for "The" only → Attend to 7 cached + 1 new → Predict next

Decode step 2: "capital"
         → Compute K,V for "capital" only → Attend to 8 cached + 1 new → Predict next

Without the KV cache, every decode step would reprocess the entire sequence from scratch — making generation quadratically slower.

KV cache lifecycle:

PhaseWhat Happens
PrefillKV cache is built (all input tokens)
DecodeKV cache grows by one entry per step
End of requestKV cache is discarded

The cache is per request — 100 concurrent users means 100 separate KV caches in GPU memory. This is the primary memory bottleneck in production serving, and it’s why optimizations matter:

  • Sliding Window Attention (SWA) — Mistral’s approach. Only cache the last W tokens (e.g., 4096). Older entries are dropped as the window slides. Memory stays constant regardless of sequence length.
  • Grouped-Query Attention (GQA) — Multiple query heads share the same KV heads, reducing cache size at the architecture level.
  • PagedAttention — Used by vLLM. Stores the KV cache in non-contiguous memory pages (like OS virtual memory) instead of one big contiguous block. Pages are allocated on demand and freed immediately when a request finishes. This dramatically increases the number of concurrent requests a GPU can serve.

The combination of SWA + GQA + PagedAttention is how Mistral handles 256K context windows without exhausting GPU memory.

Why Decoder-Only Won

The original 2017 Transformer had two halves:

  • Encoder — processes input bidirectionally (each token sees all other tokens, including future ones)
  • Decoder — generates output autoregressively (each token only sees past tokens)

Early models used both: the encoder processes the input, the decoder generates the output (T5, BART, original GPT). But the field converged on decoder-only architectures (GPT-2, GPT-3, Llama, Mistral, Claude). Why?

FactorEncoder-DecoderDecoder-Only
SimplicityTwo separate components, cross-attention between themOne unified stack
ScalingMust decide how to split parameters between encoder and decoderAll parameters in one stack
Pre-trainingOften trained with different objectives per componentSingle next-token prediction objective
FlexibilityBest for seq-to-seq (translation, summarization)Works for everything (chat, code, reasoning, translation)
In-context learningWeaker — encoder can’t easily learn from examples in the promptStrong — examples in the prompt are just more context

The key insight: decoder-only models treat the input as “just more context.” There’s no architectural distinction between the prompt and the response — it’s all one sequence. This simplicity turned out to scale better and generalize more broadly.

Training vs Inference: Two Different Worlds

AspectTrainingInference
GoalAdjust weights to minimize prediction errorUse frozen weights to generate text
DataTrillions of tokens, all at onceOne prompt at a time
ComputeThousands of GPUs for weeks1-8 GPUs per request
Cost$10M-$100M+$0.001-$0.01 per request
GradientYes — backpropagation updates every weightNo — weights are frozen
ParallelismFull sequence parallelism (teacher forcing)Prefill parallel, decode sequential

During training, the model sees the correct next token (this is called teacher forcing). It processes the entire sequence in parallel and compares its predictions against the ground truth at every position. The error signal flows backward through all layers to update weights.

The training loop in 5 steps:

1. Forward pass:    tokens → Attention → FFN → predicted next token
2. Loss:            compare prediction vs actual next token (cross-entropy)
3. Backpropagation: compute gradient for every weight ("how to adjust?")
4. Optimizer:       adjust weights in the direction that reduces loss
5. Repeat:          billions of times across the training data

The same Attention + FFN layers used during inference are the ones being trained. Training runs both a forward pass (same as inference) and a backward pass (compute gradients and update weights). Inference only runs the forward pass — weights are frozen.

Hyperparameters vs learned weights:

  • Hyperparameters — decided by humans before training: number of layers, model dimension, vocabulary size, learning rate, number of attention heads. These define the architecture. You can’t change them after training — choose wrong and you start over.
  • Learned weights — adjusted during training: all the matrices in attention (Q, K, V, O), FFN (gate, up, down), and embeddings. These encode knowledge.

Think of it like building a library: hyperparameters decide how many shelves and rooms (architecture). Training fills the shelves with books (knowledge). Inference is a visitor looking up information in the frozen library.

Fine-tuning: updating weights after pre-training

Pre-training produces a general-purpose model. Fine-tuning adapts it to a specific domain or task by continuing the training loop on a smaller, curated dataset. There are two approaches:

MethodWhat changesCostQuality
Full fine-tuningAll weights updatedExpensive (full GPU memory)Best adaptation
LoRAFrozen weights + small trainable adapters~5-10% of full costNear-equivalent quality

LoRA (Low-Rank Adaptation) is the clever shortcut: instead of updating billions of FFN weights, you freeze them and train tiny matrices (~1-5% of original parameters) alongside them. The model learns the delta — what to change — without rewriting its entire knowledge base.

What I Learned

  • Tokenization is the most overlooked bottleneck. Vocabulary size directly affects cost (tokens per word), context efficiency (tokens per document), and multilingual performance. Llama 3’s jump from 32K to 128K vocabulary wasn’t cosmetic — it was a ~30% efficiency gain for non-English languages and code.

  • Prefill vs decode explains most latency confusion. When someone says “the model is slow,” they usually mean TTFT (prefill) is high because the prompt is long. Decode speed is nearly constant regardless of prompt size. Understanding this split is essential for debugging latency and right-sizing infrastructure.

  • The KV cache is the production bottleneck most people miss. It’s per-request, grows with every decode step, and 100 concurrent users means 100 caches in GPU memory. Understanding its lifecycle (built during prefill, used during decode, discarded after) explains why PagedAttention and Sliding Window Attention exist — they’re not optional optimizations, they’re what makes long-context serving feasible.

  • Embeddings for search and embeddings inside LLMs are related but different. Inside an LLM, embeddings are a layer that converts token IDs to vectors. Embedding models are separately trained to produce one vector per text chunk, optimized for similarity search. The key RAG insight: embeddings find the right documents, but you pass the text (not the vectors) to the LLM.

  • Training is inference plus three extra steps. The forward pass is identical — same Attention, same FFN. Training adds loss computation, backpropagation, and weight updates. Understanding this makes fine-tuning and LoRA intuitive: full fine-tuning reruns all three extra steps on all weights, LoRA only runs them on tiny adapter matrices.

  • The model doesn’t plan — it commits. Each token is chosen based only on what came before. There’s no internal draft, no look-ahead, no revision. This is why chain-of-thought prompting works: it forces the model to “show its work” in the token stream, giving later tokens more context to work with. The model can’t think silently — its thoughts must be tokens.

What’s Next

  • Deep dive into attention variants: Multi-Head (MHA), Grouped-Query (GQA), Multi-Query (MQA) — trade-offs between quality and KV cache size
  • Explore how fine-tuning changes the weights: what LoRA actually modifies and where in the pipeline
  • Compare tokenizers across model families: BPE vs SentencePiece vs Unigram, with real token counts for the same text
  • Write a practical guide to prompt engineering grounded in architecture: why system prompts work, why token order matters, why examples help
Alexandre Agius

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

Related Posts

Back to Blog