Illustrates the complete inference path for single-turn text generation in an LLM, from initial prompt to final output, emphasizing prompt resolution and t
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
P["prompt: str"] --> R["resolve_generation_prompt()<br>纯文本 or 对话 prompt"]
R --> T["tokenizer.encode()<br>BPE 或 HF tokenizer"]
T --> I["prompt_ids<br>list[int] → tensor(1, seq_len)"]
I --> G["model.generate()<br>prefill + decode loop"]
G --> S["sampling<br>temperature · top-p · EOS"]
S --> O["generated_ids"]
O --> D["tokenizer.decode()"]
D --> Y["output_text"]
classDef step fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef sample fill:#fff7ed,stroke:#fb923c,color:#0f172a;
classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
class P,R,T,I,G,D step;
class S sample;
class O,Y out;
This flowchart details the step-by-step process of generating text from a Large Language Model (LLM) for a single-turn interaction. It covers prompt input, prompt resolution (handling plain text vs. dialogue formats), tokenization (encoding and decoding), the model's generation process (prefill and decode loop), and sampling techniques (temperature, top-p, EOS) to produce the final output text.
Use this diagram to understand or explain the inference pipeline of a text generation model, especially when considering how prompts are processed and how model outputs are generated. It's useful for debugging generation issues, optimizing inference performance, or designing prompt engineering strategies for LLMs.
Adapt this diagram by adding specific pre-processing or post-processing steps relevant to your application, such as input validation, response filtering, or integration with external services. You can also expand on the 'sampling' step to include more advanced decoding strategies like beam search, or detail the internal workings of 'model.generate()' for specific architectures.