LLM Inference Demystified: PagedAttention, KV Cache, MoE & Continuous Batching
The 5 key concepts every cloud architect should know about LLM serving: PagedAttention, KV cache mechanics, continuous batching, MoE trade-offs, and real production numbers.
Table of Contents
Deploying LLMs in production requires understanding five concepts that rarely get explained together: KV cache, PagedAttention, continuous batching, Mixture of Experts, and the metrics that tell you if your serving setup is actually performing. This post covers all five with real numbers and practical intuition.
The Problem
You have a model β say Mixtral 8x7B or Mistral Large β and you need to serve it to users. You spin up a GPU instance, load the model, and send a request. It works. Then you send 10 concurrent requests and everything grinds to a halt. Or worse, you run out of GPU memory on a single long conversation.
The issues:
- Memory waste β The KV cache for each request pre-allocates massive contiguous memory blocks, most of which goes unused. With naive allocation, a single H100 (80GB) might only serve 4-8 concurrent sequences.
- Throughput bottleneck β Static batching waits for the slowest sequence in the batch to finish before starting new requests. A 500-token response blocks a 20-token response.
- Hidden model size β Mixtral 8x7B sounds smaller than a 70B model, but it loads all 56B parameters into memory. Only ~14B are active per token. The name is misleading if youβre sizing GPUs.
- Wrong metrics β You measure latency in milliseconds, but your users experience time-to-first-token (TTFT) and tokens-per-second (TPS). A system with great average latency can still feel slow.
These arenβt edge cases. Theyβre the default failure modes when deploying LLMs beyond prototyping.
The Solution
Five concepts, each solving a specific piece of the inference puzzle:
- KV Cache β avoid recomputing attention over the full sequence at every token
- PagedAttention β manage KV cache memory like an OS manages virtual memory
- Continuous Batching β let requests join and leave the batch dynamically
- Mixture of Experts (MoE) β activate only a fraction of the model per token
- Production Metrics β measure what actually matters for the user experience
Together, they form the inference pipeline that production serving engines like vLLM implement:
How It Works
KV Cache: Why Inference Is Memory-Bound
The transformerβs self-attention mechanism computes three vectors for each token: Query (Q), Key (K), and Value (V). During generation, every new token needs to attend to all previous tokens in the sequence. Without caching, youβd recompute K and V for the entire history at every step.
The KV cache stores previously computed K and V vectors so theyβre only calculated once:
Step 1: "The" β compute Kβ, Vβ, store in cache
Step 2: "The cat" β compute Kβ, Vβ, read KβVβ from cache
Step 3: "The cat sat" β compute Kβ, Vβ, read KβVβ, KβVβ from cache
...
Step N: compute Kβ, Vβ, read all previous from cache
This turns a quadratic computation into a linear one. The trade-off: memory. For a model like Llama 2 70B with a 4096-token context, the KV cache for a single sequence takes ~2GB of GPU memory. Scale to 32 concurrent sequences and you need 64GB just for KV cache β nearly all of an H100βs 80GB.
The KV cache size per token per layer:
KV size = 2 Γ num_heads Γ head_dim Γ num_layers Γ precision_bytes
| Model | KV Cache per Token | KV Cache at 4K Context | At 32K Context |
|---|---|---|---|
| Mistral 7B (32 layers, GQA) | ~0.5 MB | ~2 GB | ~16 GB |
| Llama 2 70B (80 layers) | ~2.5 MB | ~10 GB | ~80 GB |
| Mixtral 8x7B (32 layers, GQA) | ~0.5 MB | ~2 GB | ~16 GB |
The problem isnβt the cache itself β itβs how itβs allocated.
PagedAttention: Virtual Memory for GPU
Traditional KV cache allocation works like old-school C malloc: each sequence gets a pre-allocated contiguous block sized for the maximum possible sequence length. If your max length is 4096 tokens but the sequence only uses 200, you waste 95% of that allocation. And since the blocks must be contiguous, you get memory fragmentation β free space exists but isnβt usable.
PagedAttention (introduced by the vLLM paper) applies the same insight that revolutionized operating systems: virtual memory with paging.
Instead of one contiguous block per sequence:
- GPU memory is divided into fixed-size blocks (e.g., 16 tokens each)
- Each sequence gets a block table mapping logical positions to physical blocks
- Blocks are allocated on demand β a new block only when the previous one fills up
- Blocks donβt need to be contiguous in physical memory
- Completed sequences free their blocks immediately for reuse
Traditional allocation:
Sequence A: [ββββββββββββββββββββββββββββββββ] (50% wasted)
Sequence B: [ββββββββββββββββββββββββββββββββ] (75% wasted)
Sequence C: [cannot allocate β no contiguous block available]
PagedAttention:
Block pool: [Aβ][Aβ][Bβ][Aβ][Bβ][free][free][free]...
Sequence A: table β blocks 0,1,3 (exact fit)
Sequence B: table β blocks 2,4 (exact fit)
Sequence C: table β blocks 5,6 (fits in free blocks!)
The results are dramatic:
| Metric | Traditional | PagedAttention |
|---|---|---|
| Memory waste | 60-80% | <4% |
| Concurrent sequences (H100) | 4-8 | 16-32+ |
| Throughput | Baseline | 2-4x higher |
This is why vLLM became the de facto serving engine. PagedAttention isnβt an optimization β itβs a paradigm shift in how GPU memory is managed for inference.
Continuous Batching: No More Waiting
Static batching groups N requests together and processes them as a batch. The problem: all requests in the batch must wait for the longest sequence to complete before any new request can start.
Static batching:
Time β
Req A (20 tokens): [ββββββββ]ββββββββββββββ β done but waiting
Req B (50 tokens): [ββββββββββββββββββββββββ] β batch completes here
Req C (queued): ββββββββββββββββββββββββ [starts after B finishes]
Continuous batching:
Time β
Req A (20 tokens): [ββββββββ]
Req B (50 tokens): [ββββββββββββββββββββββββ]
Req C: [ββββββββββββββ] β joins immediately when A exits
Continuous batching (also called iteration-level scheduling) checks after every decoding step: has any sequence finished? If yes, evict it and pull the next request from the queue. This means:
- Short responses donβt wait for long ones
- The GPU stays saturated β no idle cycles between batches
- TTFT for queued requests drops dramatically
In practice, continuous batching improves throughput by 2-3x over static batching with the same hardware, because the GPU never sits idle waiting for a slow sequence to finish.
Mixture of Experts: Sparse but Heavy
MoE models like Mixtral 8x7B use a clever trick: instead of one massive feed-forward network (FFN), they have multiple smaller βexpertβ FFNs. A learned router selects the top-K experts (usually 2) for each token.
Standard transformer (dense):
Token β Attention β FFN (all parameters active) β Output
MoE transformer (sparse):
Token β Attention β Router β Expert 3 + Expert 7 (2 of 8 active) β Output
The appeal: Mixtral 8x7B has the quality of a ~40-50B dense model but only activates ~14B parameters per token (2 experts Γ 7B each, roughly). That means lower inference compute cost per token.
The catch that trips up every architect sizing GPUs:
| What You Might Expect | Reality |
|---|---|
| β8x7B = smaller than 70Bβ | 56B total parameters loaded in GPU memory |
| βOnly 2 experts active = low memoryβ | All 8 experts must be resident β the router decides at runtime |
| βCheaper than dense 70Bβ | Cheaper per token (less compute), but similar memory footprint |
MoE models are compute-efficient (less FLOPs per token) but not memory-efficient (full model in VRAM). When sizing infrastructure:
- GPU memory β plan for the full parameter count (56B for Mixtral 8x7B β ~112GB in FP16, ~56GB in INT8)
- Compute β only ~25% of parameters are active, so inference is faster than an equivalently-sized dense model
- Bandwidth β expert selection requires reading different weights per token, which can bottleneck on memory bandwidth
For serving, this means MoE models benefit heavily from quantization (INT8 or INT4) to fit in a single GPU, and from tensor parallelism across multiple GPUs when they donβt fit.
Production Metrics: Measuring What Matters
The standard metrics β latency and throughput β donβt capture the user experience of LLM inference. Hereβs what to actually measure:
| Metric | What It Measures | Why It Matters |
|---|---|---|
| TTFT (Time to First Token) | Delay before the first token appears | User-perceived responsiveness. Includes queue wait + prefill time. |
| TPS (Tokens per Second) | Output generation speed | Reading speed β too slow and users notice the βdripβ |
| Throughput (Requests per Second) | System-level capacity | How many concurrent users the system handles |
| P50 / P95 / P99 latency | Latency distribution | P50 is typical, P99 is the tail that causes complaints |
| TPOT (Time per Output Token) | Time between consecutive tokens | Inverse of TPS β the generation cadence |
The relationship between these:
Total latency = TTFT + (output_tokens Γ TPOT)
TPS = 1 / TPOT
TTFT is dominated by the prefill phase β processing the entire input prompt in one forward pass. Long prompts (RAG with 10K context) have higher TTFT regardless of output length. This is why chunked prefill exists: split the input processing across multiple steps to avoid one massive initial delay.
Real-world targets:
| Use Case | TTFT Target | TPS Target |
|---|---|---|
| Chatbot (interactive) | < 500ms | > 30 TPS |
| Code completion | < 200ms | > 50 TPS |
| Batch processing | Donβt care | Maximize throughput |
| Streaming API | < 1s | > 15 TPS |
API vs Self-Hosted: The Break-Even
When does self-hosting make financial sense over API calls?
Using Mistral Large as an example:
| API (Mistral) | Self-Hosted (H100) | |
|---|---|---|
| Cost per 1M input tokens | $0.50 | ~$0.15 (amortized) |
| Cost per 1M output tokens | $1.50 | ~$0.45 (amortized) |
| Monthly cost at 1M tokens/month | ~$2 | ~$3,500 (H100 instance) |
| Monthly cost at 30M tokens/month | ~$45 | ~$3,500 |
| Monthly cost at 100M tokens/month | ~$150 | ~$3,500 |
The break-even sits around 20-30M tokens/month for Mistral Large. Below that, API wins on cost. Above that, self-hosted wins β and the gap widens with volume.
But cost isnβt the only factor:
| Factor | API | Self-Hosted |
|---|---|---|
| Data sovereignty | Data leaves your infra | Full control |
| Latency | Network hop + queue | Direct GPU access |
| Customization | Limited (temperature, top-p) | Full control (quantization, batching config) |
| Ops burden | Zero | Significant (GPU monitoring, model updates, scaling) |
| Scaling | Instant | Provision + deploy |
The practical rule: Start with API. Move to self-hosted when you have consistent volume > 20M tokens/month AND a team that can operate GPU infrastructure. Data sovereignty requirements can override the volume threshold.
What I Learned
-
KV cache is the memory bottleneck, not model weights. For long-context inference, KV cache can consume more GPU memory than the model itself. PagedAttention doesnβt just optimize β it fundamentally changes how many concurrent users a single GPU can serve (4-8 to 16-32+). This is why vLLM dominates production serving.
-
MoE model names are misleading for infrastructure planning. Mixtral β8x7Bβ loads 56B parameters into GPU memory, not 7B. The sparsity saves compute per token but doesnβt save memory. Every architect Iβve seen sizes MoE GPUs wrong on the first attempt. Plan for full parameter count, then appreciate the compute savings.
-
Continuous batching is the single highest-impact optimization. PagedAttention gets the attention (no pun intended), but continuous batchingβs 2-3x throughput improvement over static batching is what makes the economics work. Itβs the difference between needing 3 GPUs and 1 GPU for the same workload.
-
TTFT and TPS matter more than average latency. Users perceive a 200ms TTFT with 30 TPS streaming as fast, even if total latency is 5 seconds for a 150-token response. A system with 500ms average latency but 3-second P99 TTFT feels broken. Measure what the user experiences, not what the system reports.
-
The API vs self-hosted break-even is lower than most people think. At ~20-30M tokens/month, self-hosting starts winning on cost. But the operational burden is real β GPU monitoring, model versioning, quantization tuning, scaling policies. The right move for most teams is API until proven otherwise.
Whatβs Next
- Benchmark vLLM vs TGI vs Triton on the same model (Mistral 7B) with identical hardware β measure TTFT, TPS, and max concurrent sequences
- Deep dive into quantization techniques (INT8, INT4, GPTQ, AWQ) β how much quality do you actually lose?
- Explore speculative decoding: using a small draft model to predict multiple tokens, then verifying with the large model
- Write a practical guide to GPU selection for LLM inference: H100 vs A100 vs L4 vs Inferentia2, with cost/performance matrices
Related Posts
TFLOPS: The GPU Metric Every AI Engineer Should Understand
What TFLOPS actually measures, why FP16 matters for LLMs, and why the most important GPU bottleneck for inference isn't compute at all.
AIGetting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon
A practical walkthrough of two paths to working with Mistral β the managed API for fast prototyping and self-hosted deployment for full control β with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.
AITransformer Anatomy: Attention + FFN Demystified
A deep dive into the Transformer architecture β how attention connects tokens and why the Feed-Forward Network is the real brain of the model. Plus the key to understanding Mixture of Experts (MoE).
