TFLOPS: The GPU Metric Every AI Engineer Should Understand
What TFLOPS actually measures, why FP16 matters for LLMs, and why the most important GPU bottleneck for inference isn't compute at all.
Everyone talks about TFLOPS when comparing GPUs for AI workloads. But most people use the number wrong — they compare raw compute without understanding what actually limits their model’s speed. This post explains what TFLOPS measures, when it matters, and when it doesn’t.
The Problem
You’re choosing a GPU for LLM inference. You look at specs: the T4 does 65 TFLOPS FP16, the A100 does 312 TFLOPS FP16. Five times more compute, so the A100 should be five times faster at generating tokens, right?
Wrong. For single-user LLM inference, the A100 is faster — but not 5x faster. The reason is that TFLOPS measures the wrong bottleneck for most inference workloads. Without understanding what’s actually limiting performance, you’ll either overspend on compute you can’t use, or under-provision memory bandwidth that you desperately need.
The Solution
Think of TFLOPS as one half of the performance equation. The other half — memory bandwidth — is what actually determines token generation speed for most LLM inference scenarios.
The key insight: during inference, the GPU must load every model weight from VRAM for every single token it generates. If the weights can’t be moved fast enough, it doesn’t matter how many TFLOPS you have — the compute cores sit idle waiting for data.
How It Works
What TFLOPS Actually Measures
FLOPS stands for Floating-Point Operations Per Second. It counts how many math operations (additions, multiplications on decimal numbers) a processor can perform per second.
The scale runs from billions to quintillions:
| Unit | Value | Typical context |
|---|---|---|
| GFLOPS | 10^9 | CPU-level |
| TFLOPS | 10^12 | Single GPU |
| PFLOPS | 10^15 | GPU cluster |
| EFLOPS | 10^18 | Supercomputer |
When NVIDIA says the A100 does 312 TFLOPS, they mean 312 trillion floating-point operations per second — but only at FP16 precision.
Precision Changes Everything
GPUs have wildly different TFLOPS ratings depending on the numerical precision you use. Lower precision means smaller numbers, which means more operations can fit through the hardware simultaneously.
| GPU | FP32 (32-bit) | FP16 (16-bit) | INT8 (8-bit) |
|---|---|---|---|
| T4 | 8.1 TFLOPS | 65 TFLOPS | 130 TOPS |
| A10G | 31.2 TFLOPS | 125 TFLOPS | 250 TOPS |
| A100 | 19.5 TFLOPS | 312 TFLOPS | 624 TOPS |
| H100 | 67 TFLOPS | 990 TFLOPS | 1,979 TOPS |
The T4 jumps from 8.1 to 65 TFLOPS — an 8x increase — just by switching from FP32 to FP16. This is why running Mistral-7B in torch.float16 isn’t just about saving memory. It’s about unlocking 8x more compute throughput on the same hardware.
For LLM inference, FP16 is the standard. FP32 is wasteful — the quality difference is negligible, but you’re leaving 80-90% of your GPU’s capability on the table.
FLOPS for Training vs Inference
The same acronym means different things in different contexts, which causes confusion.
Training talks about total FLOPs (no “per second”) — the total compute budget required to train a model. The rough formula:
Total FLOPs ≈ 6 × N × D
Where N = number of parameters and D = number of training tokens. The factor of 6 accounts for the forward pass (2x), backward pass (4x — computing gradients is roughly twice the cost of the forward pass).
So Mistral-7B trained on 8 trillion tokens required approximately:
6 × 7 × 10^9 × 8 × 10^12 ≈ 3.4 × 10^23 FLOPs
That’s 340 zettaFLOPs. On a single A100, that would take over 30 years. This is why training happens on clusters of thousands of GPUs.
Inference cares about FLOPS (per second) — throughput. How fast can the GPU crunch through the matrix multiplications needed to produce the next token? But here’s where the plot twist comes in.
The Memory Bandwidth Bottleneck
During inference, for each token generated, the GPU must:
- Load all model weights from VRAM into compute cores
- Compute the matrix multiplications (attention + FFN)
- Write the result back
For a 7B parameter model in FP16, that’s 14 GB of weights loaded per token. On a T4 with 300 GB/s memory bandwidth, just moving the weights takes:
14 GB / 300 GB/s ≈ 47 ms per token ≈ ~21 tokens/second
But the T4 has 65 TFLOPS of compute. The actual math for one forward pass of a 7B model is roughly 14 billion FLOPs. At 65 TFLOPS, the compute takes:
14 × 10^9 / 65 × 10^12 ≈ 0.2 ms
The compute takes 0.2 ms. The memory transfer takes 47 ms. The GPU spends 99.6% of its time waiting for data and 0.4% actually computing. This is what “memory-bandwidth bound” means.
This ratio is captured by the arithmetic intensity — FLOPs per byte transferred. For single-batch LLM inference, the arithmetic intensity is approximately 1 (one operation per byte loaded). Modern GPUs are designed for arithmetic intensities of 100+. The mismatch is severe.
When TFLOPS Actually Matters
TFLOPS becomes the limiting factor when you increase the arithmetic intensity:
| Scenario | Arithmetic intensity | Bottleneck |
|---|---|---|
| Single-user inference (batch=1) | ~1 | Memory bandwidth |
| Batched inference (batch=32+) | ~32 | Starting to be compute |
| Training (large batches) | 100+ | Compute (TFLOPS) |
| Prefill phase (prompt processing) | High | Compute |
During training, you process large batches — the weights are loaded once and reused across many samples. The arithmetic intensity is high, and TFLOPS becomes the limiting factor.
During batched inference (serving many users simultaneously), you load weights once and compute for multiple requests. More TFLOPS = more throughput.
During single-user inference (generating tokens one at a time), memory bandwidth is king. This is why quantization (INT8, INT4) helps so much — not because INT8 compute is faster, but because smaller weights transfer faster.
Quantization: Attacking the Real Bottleneck
Quantizing Mistral-7B from FP16 to INT8 cuts the model size from 14 GB to 7 GB. On the same T4:
7 GB / 300 GB/s ≈ 23 ms per token ≈ ~43 tokens/second
You just doubled your inference speed — not by buying a faster GPU, but by reducing the amount of data that needs to move through the memory bus. The compute was never the bottleneck.
Going further to INT4 (3.5 GB): ~87 tokens/second. Each halving of precision roughly doubles single-user throughput, until you hit a quality floor where the model’s outputs degrade.
Choosing the Right GPU
Given all this, here’s a practical framework:
| Use case | What to optimize | GPU recommendation |
|---|---|---|
| Dev/prototyping | Cost | T4 (16 GB, cheap) |
| Single-user inference | Memory bandwidth + VRAM | L4 or A10G |
| High-throughput serving | TFLOPS + VRAM | A100 or H100 |
| Training | TFLOPS + interconnect | H100 cluster |
| Large models (70B+) | VRAM capacity | Multi-GPU (A100 80GB) |
Don’t just compare TFLOPS. Compare the memory bandwidth, VRAM capacity, and cost per token for your specific workload.
What I Learned
- TFLOPS is necessary but not sufficient — for single-user LLM inference, memory bandwidth is the actual bottleneck, not raw compute. A GPU with higher bandwidth but lower TFLOPS can outperform one with more compute.
- Precision is a performance lever, not just a memory optimization — running in FP16 vs FP32 isn’t about saving VRAM. It’s about unlocking 8x more TFLOPS on the same silicon, and it halves the memory bandwidth pressure too.
- Quantization attacks the right bottleneck — INT8 and INT4 quantization work for inference not because integer math is faster, but because smaller weights move through the memory bus faster. The compute cores were never the limiting factor.
What’s Next
- Benchmark Mistral-7B inference on T4 at FP16, INT8, and INT4 to validate the theoretical throughput numbers
- Profile actual memory bandwidth utilization during inference using
nvidia-smiand PyTorch profiler - Compare vLLM’s continuous batching against naive HuggingFace
generate()to measure how batching shifts the bottleneck from bandwidth to compute
Related Posts
Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon
A practical walkthrough of two paths to working with Mistral — the managed API for fast prototyping and self-hosted deployment for full control — with real code covering prompting, model selection, function calling, RAG, and INT8 quantization.
AIA Practical Guide to Fine-Tuning LLMs: From Full Training to LoRA
Understand how LLM fine-tuning works, when to use it, and how to choose between full fine-tuning, LoRA, soft prompts, and other PEFT methods.
AIHow LLMs Learn to Behave: RLHF, Reward Models, and the Alignment Problem
A practical walkthrough of how large language models are aligned with human values — from collecting feedback to PPO optimization and the reward hacking pitfalls.
