Large Language Model Single-Turn Text Generation Inference Path

ML & AI · flowchart diagram · MIT

Illustrates the complete inference path for single-turn text generation in an LLM, from initial prompt to final output, emphasizing prompt resolution and t

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM Text Generation Inference Prompting Tokenizer AI Machine Learning

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    P["prompt: str"] --> R["resolve_generation_prompt()<br>纯文本 or 对话 prompt"]
    R --> T["tokenizer.encode()<br>BPE 或 HF tokenizer"]
    T --> I["prompt_ids<br>list[int] → tensor(1, seq_len)"]
    I --> G["model.generate()<br>prefill + decode loop"]
    G --> S["sampling<br>temperature · top-p · EOS"]
    S --> O["generated_ids"]
    O --> D["tokenizer.decode()"]
    D --> Y["output_text"]

    classDef step fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef sample fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    class P,R,T,I,G,D step;
    class S sample;
    class O,Y out;

What this diagram shows

This flowchart details the step-by-step process of generating text from a Large Language Model (LLM) for a single-turn interaction. It covers prompt input, prompt resolution (handling plain text vs. dialogue formats), tokenization (encoding and decoding), the model's generation process (prefill and decode loop), and sampling techniques (temperature, top-p, EOS) to produce the final output text.

When to use it

Use this diagram to understand or explain the inference pipeline of a text generation model, especially when considering how prompts are processed and how model outputs are generated. It's useful for debugging generation issues, optimizing inference performance, or designing prompt engineering strategies for LLMs.

How to adapt it for your project

Adapt this diagram by adding specific pre-processing or post-processing steps relevant to your application, such as input validation, response filtering, or integration with external services. You can also expand on the 'sampling' step to include more advanced decoding strategies like beam search, or detail the internal workings of 'model.generate()' for specific architectures.

Key concepts