Illustrates the KV Cache inference path in LLMs, showing how prefill and iterative decoding with cached Key/Value states optimize token generation by avoid
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
P["完整 prompt_ids"] --> PF["Prefill<br>整段 prompt 一次性过模型"]
PF --> K["每层缓存 K / V"]
K --> DS["Decode Step t<br>只输入 1 个新 token"]
DS --> RP["start_pos + 新 Q/K/V 计算"]
RP --> CAT["新 K/V 追加到缓存尾部"]
CAT --> AT["attention(new query, cached KV)"]
AT --> NX["next_token logits"]
NX -. "下一步继续" .-> DS
classDef seq fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef cache fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
classDef core fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
class P,PF,DS,RP,AT,NX seq;
class K,CAT cache;
class core core;
This diagram illustrates the inference process for Large Language Models (LLMs) utilizing a Key-Value (KV) Cache. It details the initial 'Prefill' phase where the entire prompt is processed, and the subsequent iterative 'Decode Step' where new tokens are generated. During decoding, only the new token's Key and Value vectors are computed and appended to the existing cache, significantly reducing computational overhead by reusing historical K/V states for attention calculations.
Use this diagram to explain or understand the optimization techniques for LLM inference, particularly how KV caching improves efficiency and speed during text generation. It's useful for demonstrating the difference between full recomputation and incremental decoding, especially in scenarios requiring fast token generation or long sequence contexts.
To adapt this diagram, you could add specific components like a 'Tokenizer' or 'Output Layer' for more detail. You could also expand on the 'Prefill' step to show batching, or on the 'Decode Step' to include sampling strategies (e.g., greedy, top-k, nucleus). For different architectures, modify the 'Attention' block to reflect specific attention mechanisms or add steps for speculative decoding.