BPE Training Data Preparation Flow

Data Pipelines · flowchart diagram · MIT

This flowchart illustrates the sequential steps for preparing a raw Chinese corpus for Byte Pair Encoding (BPE) training, covering cleaning, splitting, and

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
BPE Tokenizer training Data pipeline NLP Large Language Models Corpus Data preprocessing

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    D0["原始中文语料<br>~141 万条"] --> D1["清洗与规整<br>去重 · HTML 清理 · 空白压缩 · 长度过滤"]
    D1 --> D2["确定性 train-valid 切分<br>SHA1 哈希"]
    D2 --> D3["插入 EOS 文档分隔符"]
    D3 --> D4["产物<br>train.txt · valid.txt · tokenizer_corpus.txt"]

    classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef step fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
    classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    class D0 data;
    class D1,D2,D3 step;
    class D4 out;

What this diagram shows

This diagram outlines the process of preparing raw Chinese corpus data for BPE training. It starts with a large raw corpus, proceeds through cleaning and normalization steps (deduplication, HTML cleanup, whitespace compression, length filtering), then performs a deterministic train-valid split using SHA1 hashing, and finally inserts EOS (End-Of-Sentence/Document) delimiters. The output consists of train.txt, valid.txt, and tokenizer_corpus.txt.

When to use it

Use this diagram when designing or documenting a data preparation pipeline for training a BPE tokenizer for large language models, especially when reproducibility, explicit document boundaries, and robust cleaning are critical. It's suitable for initial data ingestion and preprocessing stages.

How to adapt it for your project

This flow can be adapted for different languages by modifying cleaning rules and tokenizer corpus generation. Additional cleaning steps can be integrated, or alternative splitting methods can be used. The EOS delimiter can be replaced with other tokenization strategies, and the output formats can be adjusted to fit specific model training requirements.

Key concepts

BPE training
Data cleaning
Deterministic splitting
EOS token
Corpus preparation