AI

TFLOPS: The GPU Metric Every AI Engineer Should Understand

What TFLOPS actually measures, why FP16 matters for LLMs, and why the most important GPU bottleneck for inference isn't compute at all.

Alexandre Agius

Alexandre Agius

AWS Solutions Architect

7 min read
Share:

Everyone talks about TFLOPS when comparing GPUs for AI workloads. But most people use the number wrong — they compare raw compute without understanding what actually limits their model’s speed. This post explains what TFLOPS measures, when it matters, and when it doesn’t.

The Problem

You’re choosing a GPU for LLM inference. You look at specs: the T4 does 65 TFLOPS FP16, the A100 does 312 TFLOPS FP16. Five times more compute, so the A100 should be five times faster at generating tokens, right?

Wrong. For single-user LLM inference, the A100 is faster — but not 5x faster. The reason is that TFLOPS measures the wrong bottleneck for most inference workloads. Without understanding what’s actually limiting performance, you’ll either overspend on compute you can’t use, or under-provision memory bandwidth that you desperately need.

The Solution

Think of TFLOPS as one half of the performance equation. The other half — memory bandwidth — is what actually determines token generation speed for most LLM inference scenarios.

GPU Inference: Compute vs Memory Bandwidth

The key insight: during inference, the GPU must load every model weight from VRAM for every single token it generates. If the weights can’t be moved fast enough, it doesn’t matter how many TFLOPS you have — the compute cores sit idle waiting for data.

How It Works

What TFLOPS Actually Measures

FLOPS stands for Floating-Point Operations Per Second. It counts how many math operations (additions, multiplications on decimal numbers) a processor can perform per second.

The scale runs from billions to quintillions:

UnitValueTypical context
GFLOPS10^9CPU-level
TFLOPS10^12Single GPU
PFLOPS10^15GPU cluster
EFLOPS10^18Supercomputer

When NVIDIA says the A100 does 312 TFLOPS, they mean 312 trillion floating-point operations per second — but only at FP16 precision.

Precision Changes Everything

GPUs have wildly different TFLOPS ratings depending on the numerical precision you use. Lower precision means smaller numbers, which means more operations can fit through the hardware simultaneously.

GPUFP32 (32-bit)FP16 (16-bit)INT8 (8-bit)
T48.1 TFLOPS65 TFLOPS130 TOPS
A10G31.2 TFLOPS125 TFLOPS250 TOPS
A10019.5 TFLOPS312 TFLOPS624 TOPS
H10067 TFLOPS990 TFLOPS1,979 TOPS

The T4 jumps from 8.1 to 65 TFLOPS — an 8x increase — just by switching from FP32 to FP16. This is why running Mistral-7B in torch.float16 isn’t just about saving memory. It’s about unlocking 8x more compute throughput on the same hardware.

For LLM inference, FP16 is the standard. FP32 is wasteful — the quality difference is negligible, but you’re leaving 80-90% of your GPU’s capability on the table.

FLOPS for Training vs Inference

The same acronym means different things in different contexts, which causes confusion.

Training talks about total FLOPs (no “per second”) — the total compute budget required to train a model. The rough formula:

Total FLOPs ≈ 6 × N × D

Where N = number of parameters and D = number of training tokens. The factor of 6 accounts for the forward pass (2x), backward pass (4x — computing gradients is roughly twice the cost of the forward pass).

So Mistral-7B trained on 8 trillion tokens required approximately:

6 × 7 × 10^9 × 8 × 10^12 ≈ 3.4 × 10^23 FLOPs

That’s 340 zettaFLOPs. On a single A100, that would take over 30 years. This is why training happens on clusters of thousands of GPUs.

Inference cares about FLOPS (per second) — throughput. How fast can the GPU crunch through the matrix multiplications needed to produce the next token? But here’s where the plot twist comes in.

The Memory Bandwidth Bottleneck

During inference, for each token generated, the GPU must:

  1. Load all model weights from VRAM into compute cores
  2. Compute the matrix multiplications (attention + FFN)
  3. Write the result back

For a 7B parameter model in FP16, that’s 14 GB of weights loaded per token. On a T4 with 300 GB/s memory bandwidth, just moving the weights takes:

14 GB / 300 GB/s ≈ 47 ms per token ≈ ~21 tokens/second

But the T4 has 65 TFLOPS of compute. The actual math for one forward pass of a 7B model is roughly 14 billion FLOPs. At 65 TFLOPS, the compute takes:

14 × 10^9 / 65 × 10^12 ≈ 0.2 ms

The compute takes 0.2 ms. The memory transfer takes 47 ms. The GPU spends 99.6% of its time waiting for data and 0.4% actually computing. This is what “memory-bandwidth bound” means.

This ratio is captured by the arithmetic intensity — FLOPs per byte transferred. For single-batch LLM inference, the arithmetic intensity is approximately 1 (one operation per byte loaded). Modern GPUs are designed for arithmetic intensities of 100+. The mismatch is severe.

When TFLOPS Actually Matters

TFLOPS becomes the limiting factor when you increase the arithmetic intensity:

ScenarioArithmetic intensityBottleneck
Single-user inference (batch=1)~1Memory bandwidth
Batched inference (batch=32+)~32Starting to be compute
Training (large batches)100+Compute (TFLOPS)
Prefill phase (prompt processing)HighCompute

During training, you process large batches — the weights are loaded once and reused across many samples. The arithmetic intensity is high, and TFLOPS becomes the limiting factor.

During batched inference (serving many users simultaneously), you load weights once and compute for multiple requests. More TFLOPS = more throughput.

During single-user inference (generating tokens one at a time), memory bandwidth is king. This is why quantization (INT8, INT4) helps so much — not because INT8 compute is faster, but because smaller weights transfer faster.

Quantization: Attacking the Real Bottleneck

Quantizing Mistral-7B from FP16 to INT8 cuts the model size from 14 GB to 7 GB. On the same T4:

7 GB / 300 GB/s ≈ 23 ms per token ≈ ~43 tokens/second

You just doubled your inference speed — not by buying a faster GPU, but by reducing the amount of data that needs to move through the memory bus. The compute was never the bottleneck.

Going further to INT4 (3.5 GB): ~87 tokens/second. Each halving of precision roughly doubles single-user throughput, until you hit a quality floor where the model’s outputs degrade.

Choosing the Right GPU

Given all this, here’s a practical framework:

Use caseWhat to optimizeGPU recommendation
Dev/prototypingCostT4 (16 GB, cheap)
Single-user inferenceMemory bandwidth + VRAML4 or A10G
High-throughput servingTFLOPS + VRAMA100 or H100
TrainingTFLOPS + interconnectH100 cluster
Large models (70B+)VRAM capacityMulti-GPU (A100 80GB)

Don’t just compare TFLOPS. Compare the memory bandwidth, VRAM capacity, and cost per token for your specific workload.

What I Learned

  • TFLOPS is necessary but not sufficient — for single-user LLM inference, memory bandwidth is the actual bottleneck, not raw compute. A GPU with higher bandwidth but lower TFLOPS can outperform one with more compute.
  • Precision is a performance lever, not just a memory optimization — running in FP16 vs FP32 isn’t about saving VRAM. It’s about unlocking 8x more TFLOPS on the same silicon, and it halves the memory bandwidth pressure too.
  • Quantization attacks the right bottleneck — INT8 and INT4 quantization work for inference not because integer math is faster, but because smaller weights move through the memory bus faster. The compute cores were never the limiting factor.

What’s Next

  • Benchmark Mistral-7B inference on T4 at FP16, INT8, and INT4 to validate the theoretical throughput numbers
  • Profile actual memory bandwidth utilization during inference using nvidia-smi and PyTorch profiler
  • Compare vLLM’s continuous batching against naive HuggingFace generate() to measure how batching shifts the bottleneck from bandwidth to compute
Alexandre Agius

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

Related Posts

Back to Blog