SFT Data Protocol and Training Path

ML & AI · flowchart diagram · MIT

This diagram illustrates the data processing pipeline for Supervised Fine-Tuning (SFT) of a language model, emphasizing the critical 'assistant-only masked

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM Fine-tuning NLP Data Preprocessing Machine Learning Chatbot AI

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    S1["conversations JSONL"] --> S2["normalize_conversations()"]
    S2 --> S3["maybe_add_system_prompt()"]
    S3 --> S4["render_chat_prompt()<br>role markers → 线性文本"]
    S4 --> S5["BPETokenizer.encode()"]
    S5 --> S6["build_loss_labels()<br>仅保留 assistant 区间"]
    S6 --> S7["(input_ids, labels)"]
    S7 --> S8["train_sft.py<br>assistant-only masked loss"]

    classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef proto fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
    classDef loss fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    class S1 data;
    class S2,S3,S4,S5,S7 proto;
    class S6,S8 loss;
    class out out;

What this diagram shows

The diagram details the steps involved in preparing conversational data for Supervised Fine-Tuning (SFT) of a language model. It starts with raw JSONL conversations, normalizes them, adds system prompts, renders them into linear text, and then tokenizes them. The crucial step is build_loss_labels(), which masks out all tokens except those from the assistant's reply, ensuring that only the assistant's responses contribute to the loss during training.

When to use it

This pipeline is ideal when fine-tuning large language models for conversational AI tasks, where the goal is to improve the model's ability to generate high-quality assistant responses without being influenced by user prompts or system instructions during loss calculation. It's particularly useful for focusing model capacity on generating relevant and coherent replies.

How to adapt it for your project

This pipeline can be adapted by integrating different tokenizers (e.g., SentencePiece, WordPiece), customizing prompt rendering for various chat formats, or modifying the loss masking strategy to include other parts of the conversation if needed. The maybe_add_system_prompt() step can be extended for more complex prompt engineering, and the build_loss_labels() logic can be adjusted for multi-turn or role-specific loss weighting.

Key concepts