TFLOPS: The GPU Metric Every AI Engineer Should Understand
What TFLOPS actually measures, why FP16 matters for LLMs, and why the most important GPU bottleneck for inference isn't compute at all.
Everyone talks about TFLOPS when comparing GPUs for AI workloads. But most people use the number wrong — they compare raw compute without understanding what actually limits their model’s speed. This post explains what TFLOPS measures, when it matters, and when it doesn’t.
The Problem
You’re choosing a GPU for LLM inference. You look at specs: the T4 does 65 TFLOPS FP16, the A100 does 312 TFLOPS FP16. Five times more compute, so the A100 should be five times faster at generating tokens, right?
Wrong. For single-user LLM inference, the A100 is faster — but not 5x faster. The reason is that TFLOPS measures the wrong bottleneck for most inference workloads. Without understanding what’s actually limiting performance, you’ll either overspend on compute you can’t use, or under-provision memory bandwidth that you desperately need.
The Solution
Think of TFLOPS as one half of the performance equation. The other half — memory bandwidth — is what actually determines token generation speed for most LLM inference scenarios.
The key insight: during inference, the GPU must load every model weight from VRAM for every single token it generates. If the weights can’t be moved fast enough, it doesn’t matter how many TFLOPS you have — the compute cores sit idle waiting for data.
How It Works
What TFLOPS Actually Measures
FLOPS stands for Floating-Point Operations Per Second. It counts how many math operations (additions, multiplications on decimal numbers) a processor can perform per second.
The scale runs from billions to quintillions:
| Unit | Value | Typical context |
|---|---|---|
| GFLOPS | 10^9 | CPU-level |
| TFLOPS | 10^12 | Single GPU |
| PFLOPS | 10^15 | GPU cluster |
| EFLOPS | 10^18 | Supercomputer |
When NVIDIA says the A100 does 312 TFLOPS, they mean 312 trillion floating-point operations per second — but only at FP16 precision.
Precision Changes Everything
GPUs have wildly different TFLOPS ratings depending on the numerical precision you use. Lower precision means smaller numbers, which means more operations can fit through the hardware simultaneously.
| GPU | FP32 (32-bit) | FP16 (16-bit) | INT8 (8-bit) |
|---|---|---|---|
| T4 | 8.1 TFLOPS | 65 TFLOPS | 130 TOPS |
| A10G | 31.2 TFLOPS | 125 TFLOPS | 250 TOPS |
| A100 | 19.5 TFLOPS | 312 TFLOPS | 624 TOPS |
| H100 | 67 TFLOPS | 990 TFLOPS | 1,979 TOPS |
The T4 jumps from 8.1 to 65 TFLOPS — an 8x increase — just by switching from FP32 to FP16. This is why running Mistral-7B in torch.float16 isn’t just about saving memory. It’s about unlocking 8x more compute throughput on the same hardware.
For LLM inference, FP16 is the standard. FP32 is wasteful — the quality difference is negligible, but you’re leaving 80-90% of your GPU’s capability on the table.
FLOPS for Training vs Inference
The same acronym means different things in different contexts, which causes confusion.
Training talks about total FLOPs (no “per second”) — the total compute budget required to train a model. The rough formula:
Total FLOPs ≈ 6 × N × D
Where N = number of parameters and D = number of training tokens. The factor of 6 accounts for the forward pass (2x), backward pass (4x — computing gradients is roughly twice the cost of the forward pass).
So Mistral-7B trained on 8 trillion tokens required approximately:
6 × 7 × 10^9 × 8 × 10^12 ≈ 3.4 × 10^23 FLOPs
That’s 340 zettaFLOPs. On a single A100, that would take over 30 years. This is why training happens on clusters of thousands of GPUs.
Inference cares about FLOPS (per second) — throughput. How fast can the GPU crunch through the matrix multiplications needed to produce the next token? But here’s where the plot twist comes in.
The Memory Bandwidth Bottleneck
During inference, for each token generated, the GPU must:
- Load all model weights from VRAM into compute cores
- Compute the matrix multiplications (attention + FFN)
- Write the result back
For a 7B parameter model in FP16, that’s 14 GB of weights loaded per token. On a T4 with 300 GB/s memory bandwidth, just moving the weights takes:
14 GB / 300 GB/s ≈ 47 ms per token ≈ ~21 tokens/second
But the T4 has 65 TFLOPS of compute. The actual math for one forward pass of a 7B model is roughly 14 billion FLOPs. At 65 TFLOPS, the compute takes:
14 × 10^9 / 65 × 10^12 ≈ 0.2 ms
The compute takes 0.2 ms. The memory transfer takes 47 ms. The GPU spends 99.6% of its time waiting for data and 0.4% actually computing. This is what “memory-bandwidth bound” means.
This ratio is captured by the arithmetic intensity — FLOPs per byte transferred. For single-batch LLM inference, the arithmetic intensity is approximately 1 (one operation per byte loaded). Modern GPUs are designed for arithmetic intensities of 100+. The mismatch is severe.
When TFLOPS Actually Matters
TFLOPS becomes the limiting factor when you increase the arithmetic intensity:
| Scenario | Arithmetic intensity | Bottleneck |
|---|---|---|
| Single-user inference (batch=1) | ~1 | Memory bandwidth |
| Batched inference (batch=32+) | ~32 | Starting to be compute |
| Training (large batches) | 100+ | Compute (TFLOPS) |
| Prefill phase (prompt processing) | High | Compute |
During training, you process large batches — the weights are loaded once and reused across many samples. The arithmetic intensity is high, and TFLOPS becomes the limiting factor.
During batched inference (serving many users simultaneously), you load weights once and compute for multiple requests. More TFLOPS = more throughput.
During single-user inference (generating tokens one at a time), memory bandwidth is king. This is why quantization (INT8, INT4) helps so much — not because INT8 compute is faster, but because smaller weights transfer faster.
Quantization: Attacking the Real Bottleneck
Quantizing Mistral-7B from FP16 to INT8 cuts the model size from 14 GB to 7 GB. On the same T4:
7 GB / 300 GB/s ≈ 23 ms per token ≈ ~43 tokens/second
You just doubled your inference speed — not by buying a faster GPU, but by reducing the amount of data that needs to move through the memory bus. The compute was never the bottleneck.
Going further to INT4 (3.5 GB): ~87 tokens/second. Each halving of precision roughly doubles single-user throughput, until you hit a quality floor where the model’s outputs degrade.
Choosing the Right GPU
Given all this, here’s a practical framework:
| Use case | What to optimize | GPU recommendation |
|---|---|---|
| Dev/prototyping | Cost | T4 (16 GB, cheap) |
| Single-user inference | Memory bandwidth + VRAM | L4 or A10G |
| High-throughput serving | TFLOPS + VRAM | A100 or H100 |
| Training | TFLOPS + interconnect | H100 cluster |
| Large models (70B+) | VRAM capacity | Multi-GPU (A100 80GB) |
Don’t just compare TFLOPS. Compare the memory bandwidth, VRAM capacity, and cost per token for your specific workload.
What I Learned
- TFLOPS is necessary but not sufficient — for single-user LLM inference, memory bandwidth is the actual bottleneck, not raw compute. A GPU with higher bandwidth but lower TFLOPS can outperform one with more compute.
- Precision is a performance lever, not just a memory optimization — running in FP16 vs FP32 isn’t about saving VRAM. It’s about unlocking 8x more TFLOPS on the same silicon, and it halves the memory bandwidth pressure too.
- Quantization attacks the right bottleneck — INT8 and INT4 quantization work for inference not because integer math is faster, but because smaller weights move through the memory bus faster. The compute cores were never the limiting factor.
Do It Yourself
Key takeaways:
- For single-user inference, memory bandwidth > compute. The GPU spends 99%+ of its time waiting for weights to transfer, not computing. This is why a T4 (300 GB/s bandwidth, 65 TFLOPS) can feel nearly as fast as an A100 (2,000 GB/s, 312 TFLOPS) for batch=1 inference — the extra TFLOPS sit idle.
- Quantization attacks the right bottleneck. INT8 doubles throughput not because integer math is 2x faster, but because 7GB of weights transfer twice as fast as 14GB through the same 300 GB/s pipe. The compute cores were never the limiting factor.
- Precision is a performance dial, not just memory. Running in FP16 vs FP32 unlocks 8x more TFLOPS on the same silicon AND halves memory bandwidth pressure. If you’re running FP32 for LLM inference, you’re leaving 80-90% of your GPU capability unused.
Try it now:
- Profile memory bandwidth on your GPU: Load Mistral-7B in FP16 and run
nvidia-smi dmon -s uduring inference to watch GPU utilization. Compare this to memory bandwidth: runnvidia-smi dmon -s mand look at memory throughput. If utilization is low but bandwidth is saturated, you’re memory-bound. PyTorch profiler tutorial: PyTorch docs. - Benchmark quantization impact: Use the HuggingFace
bitsandbytesintegration to load Mistral-7B in FP16, INT8, and INT4. Measure tokens/second withtimeor the PyTorch profiler for each precision level. Calculate the theoretical bandwidth limit (model_size / bandwidth) and compare to actual throughput. Code example in the BitsAndBytes docs. - Test on different batch sizes: Run the same model with batch sizes 1, 4, 16, 32 using HuggingFace’s
generate()function. Watch when the bottleneck shifts from memory bandwidth (low utilization at batch=1) to compute (high utilization at batch=32+). Thenvidia-smi dmonoutput will show the transition clearly.
Never miss a post
Get notified when I publish new articles about AI, Cloud, and AWS.
No spam, unsubscribe anytime.
Comments
Sign in to leave a comment
Related Posts
Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon
A practical walkthrough of two paths to working with Mistral — the managed API for fast prototyping and self-hosted deployment for full control — with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.
Chemistry LLMs in the Real World: What a Discovery Call Taught Me About AI in Chemical R&D
A discovery call with a global specialty chemicals company revealed that the real AI bottleneck isn't models — it's data. Here's what enterprise chemistry teams actually need versus what the hype promises.
A Practical Guide to Fine-Tuning LLMs: From Full Training to LoRA
Understand how LLM fine-tuning works, when to use it, and how to choose between full fine-tuning, LoRA, soft prompts, and other PEFT methods.
