This diagram illustrates the Supervised Fine-Tuning (SFT) data pipeline, focusing on processing conversational data and applying an assistant-only masked l
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
S1["conversations JSONL"] --> S2["normalize_conversations()"]
S2 --> S3["maybe_add_system_prompt()"]
S3 --> S4["render_chat_prompt()<br>role markers → 线性文本"]
S4 --> S5["BPETokenizer.encode()"]
S5 --> S6["build_loss_labels()<br>仅保留 assistant 区间"]
S6 --> S7["(input_ids, labels)"]
S7 --> S8["train_sft.py<br>assistant-only masked loss"]
classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef proto fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
classDef loss fill:#fff7ed,stroke:#fb923c,color:#0f172a;
classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
class S1 data;
class S2,S3,S4,S5,S7 proto;
class S6,S8 loss;
class out out;
The diagram details the data preparation and training pipeline for Supervised Fine-Tuning (SFT) of a language model. It starts with conversational JSONL data, normalizes it, optionally adds system prompts, renders chat prompts into linear text, and tokenizes it using BPE. The core innovation shown is the 'build_loss_labels()' step, which ensures that only the assistant's reply segments contribute to the training loss, preventing the model from learning to predict user prompts or waste capacity on non-assistant text. Finally, the processed '(input_ids, labels)' are used in 'train_sft.py' for training with this specific masked loss.
This pipeline is used when fine-tuning large language models (LLMs) on conversational data, especially when the goal is to optimize the model's ability to generate high-quality assistant responses without being influenced by user prompts or system instructions during loss calculation. It's ideal for supervised fine-tuning tasks where precise control over the loss signal is desired.
This pipeline can be adapted by changing the 'normalize_conversations()' logic for different data formats, modifying 'maybe_add_system_prompt()' for various system instructions, or altering 'render_chat_prompt()' for different role markers or prompt templates. The 'BPETokenizer' can be swapped for other tokenizers (e.g., SentencePiece). The 'build_loss_labels()' function can be adjusted to include or exclude other parts of the conversation from the loss, or to implement different weighting strategies. The 'train_sft.py' component can be extended with different optimizers, learning rate schedules, or regularization techniques.