Inference Playground

An interactive guide to how LLMs serve your requests on a GPU.

Section 1

What happens when you send a prompt?

When you type a message and press enter, the GPU does two things in sequence — and understanding this sequence explains almost everything about LLM performance.

First: Prefill. The GPU reads your entire input prompt in parallel, processing all tokens at once. This is compute-heavy. It's why there's a brief pause before the first word appears.

Then: Decode. The GPU generates the response one token at a time. Each new token depends on all previous tokens — so this is memory-heavy. The GPU reads from the KV cache on every step. This is the text streaming you see in ChatGPT.

Two key metrics fall out of this: TTFT (Time to First Token) measures how long you wait for the first word. TBT (Time Between Tokens) measures how smooth the streaming feels — the tick marks in the demo below.

Live demo — request timeline

Prefill

Decode (each tick mark = 1 token = 1 TBT)

Prompt

What is the capital of France?

Response

Section 2

The memory problem: KV Cache

During decode, the GPU needs to remember every previous token to generate the next one. It does this by storing Key and Value vectors for each token — this is the KV cache. It grows linearly: every new token adds to it.

For Llama 3 8B, each token costs about 128 KB of GPU memory. An H100 has 80 GB total. The model weights take ~17 GB. That leaves ~63 GB for KV cache — sounds like a lot, until you have dozens of users at once.

When VRAM fills up, the system must evict a request's KV cache to CPU memory over the PCIe bus. That request stalls until the data is paged back in. This is the core tension in LLM serving: memory is the bottleneck.

Drag the sliders below to feel the pressure.

VRAM pressure calculator — Llama 3 8B on H100

80 GB capacity

62.9 GB free

17 GB weights

H100 SXM 80 GB

Concurrent users: 1

1100

Tokens per user: 500

10032 K

Memory breakdown
Model weights
17 GB
KV cache (1 × 500 × 128 KB)
0.1 GB
Total
17.1 / 80 GB

Section 3

When requests compete

A real GPU doesn't serve one user at a time. Dozens of requests share the same chip. The simplest way to manage them: FCFS — First Come, First Served. Process requests in the order they arrive.

This works fine when all requests are similar in size. But when a 50K-token document summarization arrives before your 100-token chat message, you wait for that entire long request to finish prefill before yours can start. This is called Head-of-Line Blocking.

The Gantt chart below shows 5 requests arriving at a shared GPU. Watch what happens to the interactive ones.

FCFS scheduling — 5 requests (I = interactive, B = batch)

queued

prefill

decode

B0 — batch.

I1 — inter.

I2 — inter.

I3 — inter.

B4 — batch.

Now you understand the physics. Time to experiment.

Adjust the parameters below. Watch how arrival rate, memory pressure, and scheduling policy interact in ways that are hard to reason about without seeing them.

Inference PlaygroundLLM Serving Simulator

Scenarios:

Clicking a scenario loads its parameters into the sidebar. Click "Update Simulation" to run.

Select a scenario or configure the sidebar

then click "Update Simulation" to compute

No data

idle