Illustrates the KV Cache inference path in large language models, detailing how it optimizes token generation by reusing historical K/V states to achieve s
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
P["完整 prompt_ids"] --> PF["Prefill<br>整段 prompt 一次性过模型"]
PF --> K["每层缓存 K / V"]
K --> DS["Decode Step t<br>只输入 1 个新 token"]
DS --> RP["start_pos + 新 Q/K/V 计算"]
RP --> CAT["新 K/V 追加到缓存尾部"]
CAT --> AT["attention(new query, cached KV)"]
AT --> NX["next_token logits"]
NX -. "下一步继续" .-> DS
classDef seq fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef cache fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
classDef core fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
class P,PF,DS,RP,AT,NX seq;
class K,CAT cache;
class core core;
This diagram illustrates the KV Cache inference path, starting with a full prompt prefill, caching Key/Value states, and then proceeding with iterative decode steps. Each decode step involves calculating new Q/K/V for a single token, appending new K/V to the cache, performing attention with the cached K/V, and predicting the next token logits.
Use this diagram to explain or understand the optimization of large language model inference, particularly when discussing how to speed up token generation for long sequences or in interactive (REPL) scenarios where previous context needs to be reused efficiently.
This pattern can be adapted for various Transformer-based models by adjusting the caching mechanism for different attention layers or model architectures. It can also be extended to distributed inference setups where KV caches might be sharded or managed across multiple devices.