KV Cache Inference Path for Large Language Models

ML & AI · flowchart diagram · MIT

Illustrates the KV Cache inference path in LLMs, showing how prefill and iterative decoding with cached Key/Value states optimize token generation by avoid

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM Large Language Model Inference KV Cache Transformer Deep Learning AI Optimization

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    P["完整 prompt_ids"] --> PF["Prefill<br>整段 prompt 一次性过模型"]
    PF --> K["每层缓存 K / V"]
    K --> DS["Decode Step t<br>只输入 1 个新 token"]
    DS --> RP["start_pos + 新 Q/K/V 计算"]
    RP --> CAT["新 K/V 追加到缓存尾部"]
    CAT --> AT["attention(new query, cached KV)"]
    AT --> NX["next_token logits"]
    NX -. "下一步继续" .-> DS

    classDef seq fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef cache fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    classDef core fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
    class P,PF,DS,RP,AT,NX seq;
    class K,CAT cache;
    class core core;

What this diagram shows

This diagram illustrates the inference process for Large Language Models (LLMs) utilizing a Key-Value (KV) Cache. It details the initial 'Prefill' phase where the entire prompt is processed, and the subsequent iterative 'Decode Step' where new tokens are generated. During decoding, only the new token's Key and Value vectors are computed and appended to the existing cache, significantly reducing computational overhead by reusing historical K/V states for attention calculations.

When to use it

Use this diagram to explain or understand the optimization techniques for LLM inference, particularly how KV caching improves efficiency and speed during text generation. It's useful for demonstrating the difference between full recomputation and incremental decoding, especially in scenarios requiring fast token generation or long sequence contexts.

How to adapt it for your project

To adapt this diagram, you could add specific components like a 'Tokenizer' or 'Output Layer' for more detail. You could also expand on the 'Prefill' step to show batching, or on the 'Decode Step' to include sampling strategies (e.g., greedy, top-k, nucleus). For different architectures, modify the 'Attention' block to reflect specific attention mechanisms or add steps for speculative decoding.

Key concepts