This diagram illustrates the data processing pipeline for Supervised Fine-Tuning (SFT) of a language model, emphasizing the critical 'assistant-only masked
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
S1["conversations JSONL"] --> S2["normalize_conversations()"]
S2 --> S3["maybe_add_system_prompt()"]
S3 --> S4["render_chat_prompt()<br>role markers → 线性文本"]
S4 --> S5["BPETokenizer.encode()"]
S5 --> S6["build_loss_labels()<br>仅保留 assistant 区间"]
S6 --> S7["(input_ids, labels)"]
S7 --> S8["train_sft.py<br>assistant-only masked loss"]
classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef proto fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
classDef loss fill:#fff7ed,stroke:#fb923c,color:#0f172a;
classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
class S1 data;
class S2,S3,S4,S5,S7 proto;
class S6,S8 loss;
class out out;
The diagram details the steps involved in preparing conversational data for Supervised Fine-Tuning (SFT) of a language model. It starts with raw JSONL conversations, normalizes them, adds system prompts, renders them into linear text, and then tokenizes them. The crucial step is build_loss_labels(), which masks out all tokens except those from the assistant's reply, ensuring that only the assistant's responses contribute to the loss during training.
This pipeline is ideal when fine-tuning large language models for conversational AI tasks, where the goal is to improve the model's ability to generate high-quality assistant responses without being influenced by user prompts or system instructions during loss calculation. It's particularly useful for focusing model capacity on generating relevant and coherent replies.
This pipeline can be adapted by integrating different tokenizers (e.g., SentencePiece, WordPiece), customizing prompt rendering for various chat formats, or modifying the loss masking strategy to include other parts of the conversation if needed. The maybe_add_system_prompt() step can be extended for more complex prompt engineering, and the build_loss_labels() logic can be adjusted for multi-turn or role-specific loss weighting.