Inference Playground

An interactive guide to how LLMs serve your requests on a GPU.

developed by Lennox Fu 42

Section 1

What happens when you send a prompt?

When you type a message and press enter, the GPU does two things in sequence — and understanding this sequence explains almost everything about LLM performance.

First: Prefill. The GPU reads your entire input prompt in parallel, processing all tokens at once. This is compute-heavy. It's why there's a brief pause before the first word appears.

Then: Decode. The GPU generates the response one token at a time. Each new token depends on all previous tokens — so this is memory-heavy. The GPU reads from the KV cache on every step. This is the text streaming you see in ChatGPT.

Two key metrics fall out of this: TTFT (Time to First Token) measures how long you wait for the first word. TBT (Time Between Tokens) measures how smooth the streaming feels — the tick marks in the demo below.

Live demo — request timeline

Prefill
Decode (each tick mark = 1 token = 1 TBT)

Prompt

What is the capital of France?

Response

Section 2

The memory problem: KV Cache

During decode, the GPU needs to remember every previous token to generate the next one. It does this by storing Key and Value vectors for each token — this is the KV cache. It grows linearly: every new token adds to it.

For Llama 3 8B, each token costs about 128 KB of GPU memory. An H100 has 80 GB total. The model weights take ~17 GB. That leaves ~63 GB for KV cache — sounds like a lot, until you have dozens of users at once.

When VRAM fills up, the system must evict a request's KV cache to CPU memory over the PCIe bus. That request stalls until the data is paged back in. This is the core tension in LLM serving: memory is the bottleneck.

Drag the sliders below to feel the pressure.

VRAM pressure calculator — Llama 3 8B on H100

80 GB capacity
62.9 GB free
17 GB weights
H100 SXM 80 GB
1100
10032 K
Memory breakdown
Model weights
17 GB
KV cache (1 × 500 × 128 KB)
0.1 GB
Total
17.1 / 80 GB

Section 3

When requests compete

A real GPU doesn't serve one user at a time. Dozens of requests share the same chip. The simplest way to manage them: FCFS — First Come, First Served. Process requests in the order they arrive.

This works fine when all requests are similar in size. But when a 50K-token document summarization arrives before your 100-token chat message, you wait for that entire long request to finish prefill before yours can start. This is called Head-of-Line Blocking.

The Gantt chart below shows 5 requests arriving at a shared GPU. Watch what happens to the interactive ones.

FCFS scheduling — 5 requests (I = interactive, B = batch)

queued
prefill
decode
B0batch.
I1inter.
I2inter.
I3inter.
B4batch.

Now you understand the physics. Time to experiment.

Adjust the parameters below. Watch how arrival rate, memory pressure, and scheduling policy interact in ways that are hard to reason about without seeing them.

Inference PlaygroundLLM Serving Simulator
Scenarios:
Clicking a scenario loads its parameters into the sidebar. Click "Update Simulation" to run.
Select a scenario or configure the sidebar
then click "Update Simulation" to compute
No data
idle