SFT Data Protocol and Assistant-Only Loss Training Path

ML & AI · flowchart diagram · MIT

This diagram illustrates the Supervised Fine-Tuning (SFT) data pipeline, focusing on processing conversational data and applying an assistant-only masked l

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM Fine-tuning NLP Data pipeline Loss function AI training Conversational AI

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    S1["conversations JSONL"] --> S2["normalize_conversations()"]
    S2 --> S3["maybe_add_system_prompt()"]
    S3 --> S4["render_chat_prompt()<br>role markers → 线性文本"]
    S4 --> S5["BPETokenizer.encode()"]
    S5 --> S6["build_loss_labels()<br>仅保留 assistant 区间"]
    S6 --> S7["(input_ids, labels)"]
    S7 --> S8["train_sft.py<br>assistant-only masked loss"]

    classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef proto fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
    classDef loss fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    class S1 data;
    class S2,S3,S4,S5,S7 proto;
    class S6,S8 loss;
    class out out;

What this diagram shows

The diagram details the data preparation and training pipeline for Supervised Fine-Tuning (SFT) of a language model. It starts with conversational JSONL data, normalizes it, optionally adds system prompts, renders chat prompts into linear text, and tokenizes it using BPE. The core innovation shown is the 'build_loss_labels()' step, which ensures that only the assistant's reply segments contribute to the training loss, preventing the model from learning to predict user prompts or waste capacity on non-assistant text. Finally, the processed '(input_ids, labels)' are used in 'train_sft.py' for training with this specific masked loss.

When to use it

This pipeline is used when fine-tuning large language models (LLMs) on conversational data, especially when the goal is to optimize the model's ability to generate high-quality assistant responses without being influenced by user prompts or system instructions during loss calculation. It's ideal for supervised fine-tuning tasks where precise control over the loss signal is desired.

How to adapt it for your project

This pipeline can be adapted by changing the 'normalize_conversations()' logic for different data formats, modifying 'maybe_add_system_prompt()' for various system instructions, or altering 'render_chat_prompt()' for different role markers or prompt templates. The 'BPETokenizer' can be swapped for other tokenizers (e.g., SentencePiece). The 'build_loss_labels()' function can be adjusted to include or exclude other parts of the conversation from the loss, or to implement different weighting strategies. The 'train_sft.py' component can be extended with different optimizers, learning rate schedules, or regularization techniques.

Key concepts

Supervised Fine-Tuning (SFT)
Assistant-Only Loss Masking
Conversational Data Processing
BPE Tokenization
Prompt Engineering