KV Cache Inference Path for LLMs

ML & AI · flowchart diagram · MIT

Illustrates the KV Cache inference path in large language models, detailing how it optimizes token generation by reusing historical K/V states to achieve s

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
KV Cache LLM Inference Transformer Attention Optimization Generative AI

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    P["完整 prompt_ids"] --> PF["Prefill<br>整段 prompt 一次性过模型"]
    PF --> K["每层缓存 K / V"]
    K --> DS["Decode Step t<br>只输入 1 个新 token"]
    DS --> RP["start_pos + 新 Q/K/V 计算"]
    RP --> CAT["新 K/V 追加到缓存尾部"]
    CAT --> AT["attention(new query, cached KV)"]
    AT --> NX["next_token logits"]
    NX -. "下一步继续" .-> DS

    classDef seq fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef cache fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    classDef core fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
    class P,PF,DS,RP,AT,NX seq;
    class K,CAT cache;
    class core core;

What this diagram shows

This diagram illustrates the KV Cache inference path, starting with a full prompt prefill, caching Key/Value states, and then proceeding with iterative decode steps. Each decode step involves calculating new Q/K/V for a single token, appending new K/V to the cache, performing attention with the cached K/V, and predicting the next token logits.

When to use it

Use this diagram to explain or understand the optimization of large language model inference, particularly when discussing how to speed up token generation for long sequences or in interactive (REPL) scenarios where previous context needs to be reused efficiently.

How to adapt it for your project

This pattern can be adapted for various Transformer-based models by adjusting the caching mechanism for different attention layers or model architectures. It can also be extended to distributed inference setups where KV caches might be sharded or managed across multiple devices.

Key concepts

KV Cache
LLM Inference
Prefill
Decode Step
Attention Mechanism