TFLOPS: The GPU Metric Every AI Engineer Should Understand

What TFLOPS actually measures, why FP16 matters for LLMs, and why the most important GPU bottleneck for inference isn't compute at all.

Alexandre Agius

AWS Solutions Architect

February 24, 2026 7 min read

AI LLM GPU TFLOPS Inference Machine Learning Performance

Table of Contents

The Problem
The Solution
How It Works
What TFLOPS Actually Measures
Precision Changes Everything
FLOPS for Training vs Inference
The Memory Bandwidth Bottleneck
When TFLOPS Actually Matters
Quantization: Attacking the Real Bottleneck
Choosing the Right GPU
What I Learned
What’s Next

Everyone talks about TFLOPS when comparing GPUs for AI workloads. But most people use the number wrong — they compare raw compute without understanding what actually limits their model’s speed. This post explains what TFLOPS measures, when it matters, and when it doesn’t.

The Problem

You’re choosing a GPU for LLM inference. You look at specs: the T4 does 65 TFLOPS FP16, the A100 does 312 TFLOPS FP16. Five times more compute, so the A100 should be five times faster at generating tokens, right?

Wrong. For single-user LLM inference, the A100 is faster — but not 5x faster. The reason is that TFLOPS measures the wrong bottleneck for most inference workloads. Without understanding what’s actually limiting performance, you’ll either overspend on compute you can’t use, or under-provision memory bandwidth that you desperately need.

The Solution

Think of TFLOPS as one half of the performance equation. The other half — memory bandwidth — is what actually determines token generation speed for most LLM inference scenarios.

GPU Inference: Compute vs Memory Bandwidth

The key insight: during inference, the GPU must load every model weight from VRAM for every single token it generates. If the weights can’t be moved fast enough, it doesn’t matter how many TFLOPS you have — the compute cores sit idle waiting for data.

How It Works

What TFLOPS Actually Measures

FLOPS stands for Floating-Point Operations Per Second. It counts how many math operations (additions, multiplications on decimal numbers) a processor can perform per second.

The scale runs from billions to quintillions:

Unit	Value	Typical context
GFLOPS	10^9	CPU-level
TFLOPS	10^12	Single GPU
PFLOPS	10^15	GPU cluster
EFLOPS	10^18	Supercomputer

When NVIDIA says the A100 does 312 TFLOPS, they mean 312 trillion floating-point operations per second — but only at FP16 precision.

Precision Changes Everything

GPUs have wildly different TFLOPS ratings depending on the numerical precision you use. Lower precision means smaller numbers, which means more operations can fit through the hardware simultaneously.

GPU	FP32 (32-bit)	FP16 (16-bit)	INT8 (8-bit)
T4	8.1 TFLOPS	65 TFLOPS	130 TOPS
A10G	31.2 TFLOPS	125 TFLOPS	250 TOPS
A100	19.5 TFLOPS	312 TFLOPS	624 TOPS
H100	67 TFLOPS	990 TFLOPS	1,979 TOPS

The T4 jumps from 8.1 to 65 TFLOPS — an 8x increase — just by switching from FP32 to FP16. This is why running Mistral-7B in torch.float16 isn’t just about saving memory. It’s about unlocking 8x more compute throughput on the same hardware.

For LLM inference, FP16 is the standard. FP32 is wasteful — the quality difference is negligible, but you’re leaving 80-90% of your GPU’s capability on the table.

FLOPS for Training vs Inference

The same acronym means different things in different contexts, which causes confusion.

Training talks about total FLOPs (no “per second”) — the total compute budget required to train a model. The rough formula:

Total FLOPs ≈ 6 × N × D

Where N = number of parameters and D = number of training tokens. The factor of 6 accounts for the forward pass (2x), backward pass (4x — computing gradients is roughly twice the cost of the forward pass).

So Mistral-7B trained on 8 trillion tokens required approximately:

6 × 7 × 10^9 × 8 × 10^12 ≈ 3.4 × 10^23 FLOPs

That’s 340 zettaFLOPs. On a single A100, that would take over 30 years. This is why training happens on clusters of thousands of GPUs.

Inference cares about FLOPS (per second) — throughput. How fast can the GPU crunch through the matrix multiplications needed to produce the next token? But here’s where the plot twist comes in.

The Memory Bandwidth Bottleneck

During inference, for each token generated, the GPU must:

Load all model weights from VRAM into compute cores
Compute the matrix multiplications (attention + FFN)
Write the result back

For a 7B parameter model in FP16, that’s 14 GB of weights loaded per token. On a T4 with 300 GB/s memory bandwidth, just moving the weights takes:

14 GB / 300 GB/s ≈ 47 ms per token ≈ ~21 tokens/second

But the T4 has 65 TFLOPS of compute. The actual math for one forward pass of a 7B model is roughly 14 billion FLOPs. At 65 TFLOPS, the compute takes:

14 × 10^9 / 65 × 10^12 ≈ 0.2 ms

The compute takes 0.2 ms. The memory transfer takes 47 ms. The GPU spends 99.6% of its time waiting for data and 0.4% actually computing. This is what “memory-bandwidth bound” means.

This ratio is captured by the arithmetic intensity — FLOPs per byte transferred. For single-batch LLM inference, the arithmetic intensity is approximately 1 (one operation per byte loaded). Modern GPUs are designed for arithmetic intensities of 100+. The mismatch is severe.

When TFLOPS Actually Matters

TFLOPS becomes the limiting factor when you increase the arithmetic intensity:

Scenario	Arithmetic intensity	Bottleneck
Single-user inference (batch=1)	~1	Memory bandwidth
Batched inference (batch=32+)	~32	Starting to be compute
Training (large batches)	100+	Compute (TFLOPS)
Prefill phase (prompt processing)	High	Compute

During training, you process large batches — the weights are loaded once and reused across many samples. The arithmetic intensity is high, and TFLOPS becomes the limiting factor.

During batched inference (serving many users simultaneously), you load weights once and compute for multiple requests. More TFLOPS = more throughput.

During single-user inference (generating tokens one at a time), memory bandwidth is king. This is why quantization (INT8, INT4) helps so much — not because INT8 compute is faster, but because smaller weights transfer faster.

Quantization: Attacking the Real Bottleneck

Quantizing Mistral-7B from FP16 to INT8 cuts the model size from 14 GB to 7 GB. On the same T4:

7 GB / 300 GB/s ≈ 23 ms per token ≈ ~43 tokens/second

You just doubled your inference speed — not by buying a faster GPU, but by reducing the amount of data that needs to move through the memory bus. The compute was never the bottleneck.

Going further to INT4 (3.5 GB): ~87 tokens/second. Each halving of precision roughly doubles single-user throughput, until you hit a quality floor where the model’s outputs degrade.

Choosing the Right GPU

Given all this, here’s a practical framework:

Use case	What to optimize	GPU recommendation
Dev/prototyping	Cost	T4 (16 GB, cheap)
Single-user inference	Memory bandwidth + VRAM	L4 or A10G
High-throughput serving	TFLOPS + VRAM	A100 or H100
Training	TFLOPS + interconnect	H100 cluster
Large models (70B+)	VRAM capacity	Multi-GPU (A100 80GB)

Don’t just compare TFLOPS. Compare the memory bandwidth, VRAM capacity, and cost per token for your specific workload.

What I Learned

TFLOPS is necessary but not sufficient — for single-user LLM inference, memory bandwidth is the actual bottleneck, not raw compute. A GPU with higher bandwidth but lower TFLOPS can outperform one with more compute.
Precision is a performance lever, not just a memory optimization — running in FP16 vs FP32 isn’t about saving VRAM. It’s about unlocking 8x more TFLOPS on the same silicon, and it halves the memory bandwidth pressure too.
Quantization attacks the right bottleneck — INT8 and INT4 quantization work for inference not because integer math is faster, but because smaller weights move through the memory bus faster. The compute cores were never the limiting factor.

What’s Next

Benchmark Mistral-7B inference on T4 at FP16, INT8, and INT4 to validate the theoretical throughput numbers
Profile actual memory bandwidth utilization during inference using nvidia-smi and PyTorch profiler
Compare vLLM’s continuous batching against naive HuggingFace generate() to measure how batching shifts the bottleneck from bandwidth to compute

Alexandre Agius

AWS Solutions Architect

Passionate about AI & Security. Building scalable cloud solutions and helping organizations leverage AWS services to innovate faster. Specialized in Generative AI, serverless architectures, and security best practices.

LinkedIn GitHub

TFLOPS: The GPU Metric Every AI Engineer Should Understand

The Problem

The Solution

How It Works

What TFLOPS Actually Measures

Precision Changes Everything

FLOPS for Training vs Inference

The Memory Bandwidth Bottleneck

When TFLOPS Actually Matters

Quantization: Attacking the Real Bottleneck

Choosing the Right GPU

What I Learned

What’s Next

Alexandre Agius

Related Posts

Getting Hands-On with Mistral AI: From API to Self-Hosted in One Afternoon

A Practical Guide to Fine-Tuning LLMs: From Full Training to LoRA

How LLMs Learn to Behave: RLHF, Reward Models, and the Alignment Problem