This flowchart illustrates the end-to-end inference path for single-turn text generation in a Large Language Model, from prompt input to final output text.
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
P["prompt: str"] --> R["resolve_generation_prompt()<br>纯文本 or 对话 prompt"]
R --> T["tokenizer.encode()<br>BPE 或 HF tokenizer"]
T --> I["prompt_ids<br>list[int] → tensor(1, seq_len)"]
I --> G["model.generate()<br>prefill + decode loop"]
G --> S["sampling<br>temperature · top-p · EOS"]
S --> O["generated_ids"]
O --> D["tokenizer.decode()"]
D --> Y["output_text"]
classDef step fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef sample fill:#fff7ed,stroke:#fb923c,color:#0f172a;
classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
class P,R,T,I,G,D step;
class S sample;
class O,Y out;
The diagram details the sequential steps involved in generating text using a Large Language Model. It starts with a raw prompt, which is then resolved into the correct format (plain text or dialogue), tokenized into numerical IDs, and fed into the model's generation process. The model performs prefill and a decode loop, followed by a sampling step (considering temperature, top-p, and End-Of-Sequence tokens). Finally, the generated IDs are decoded back into human-readable output text.
Use this diagram to understand or explain the inference pipeline of a text generation model, debug generation issues, or design custom text generation applications. It's particularly useful when discussing the interplay between tokenization, model generation, and sampling strategies.
This flow can be adapted by integrating different tokenizers (e.g., SentencePiece), modifying sampling parameters (e.g., adding top-k, beam search), incorporating post-processing steps for the output text, or extending it for multi-turn conversations by adding a history management component before prompt resolution.