An interactive guide to how LLMs serve your requests on a GPU.
developed by Lennox Fu 42Section 1
When you type a message and press enter, the GPU does two things in sequence — and understanding this sequence explains almost everything about LLM performance.
First: Prefill. The GPU reads your entire input prompt in parallel, processing all tokens at once. This is compute-heavy. It's why there's a brief pause before the first word appears.
Then: Decode. The GPU generates the response one token at a time. Each new token depends on all previous tokens — so this is memory-heavy. The GPU reads from the KV cache on every step. This is the text streaming you see in ChatGPT.
Two key metrics fall out of this: TTFT (Time to First Token) measures how long you wait for the first word. TBT (Time Between Tokens) measures how smooth the streaming feels — the tick marks in the demo below.
Live demo — request timeline
Prompt
What is the capital of France?
Response
Section 2
During decode, the GPU needs to remember every previous token to generate the next one. It does this by storing Key and Value vectors for each token — this is the KV cache. It grows linearly: every new token adds to it.
For Llama 3 8B, each token costs about 128 KB of GPU memory. An H100 has 80 GB total. The model weights take ~17 GB. That leaves ~63 GB for KV cache — sounds like a lot, until you have dozens of users at once.
When VRAM fills up, the system must evict a request's KV cache to CPU memory over the PCIe bus. That request stalls until the data is paged back in. This is the core tension in LLM serving: memory is the bottleneck.
Drag the sliders below to feel the pressure.
VRAM pressure calculator — Llama 3 8B on H100
Section 3
A real GPU doesn't serve one user at a time. Dozens of requests share the same chip. The simplest way to manage them: FCFS — First Come, First Served. Process requests in the order they arrive.
This works fine when all requests are similar in size. But when a 50K-token document summarization arrives before your 100-token chat message, you wait for that entire long request to finish prefill before yours can start. This is called Head-of-Line Blocking.
The Gantt chart below shows 5 requests arriving at a shared GPU. Watch what happens to the interactive ones.
FCFS scheduling — 5 requests (I = interactive, B = batch)
Adjust the parameters below. Watch how arrival rate, memory pressure, and scheduling policy interact in ways that are hard to reason about without seeing them.