Illustrates the complete data preparation pipeline for pre-training a MicroLM, from raw Chinese corpus to cleaned, split, and EOS-encoded datasets.
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
D0["原始中文语料<br>~141 万条"] --> D1["清洗与规整<br>去重 · HTML 清理 · 空白压缩 · 长度过滤"]
D1 --> D2["确定性 train-valid 切分<br>SHA1 哈希"]
D2 --> D3["插入 EOS 文档分隔符"]
D3 --> D4["产物<br>train.txt · valid.txt · tokenizer_corpus.txt"]
classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef step fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
class D0 data;
class D1,D2,D3 step;
class D4 out;
This diagram details the sequential steps involved in preparing pre-training data for a language model. It starts with a raw Chinese corpus, proceeds through cleaning and normalization (deduplication, HTML cleanup, whitespace compression, length filtering), a deterministic train-valid split using SHA1 hashing, and finally, the insertion of EOS (End-Of-Sentence/Document) delimiters. The output consists of train.txt, valid.txt, and tokenizer_corpus.txt files.
Use this flow when preparing large text corpora for pre-training language models, especially when deterministic data processing, explicit document boundaries, and robust cleaning are critical for model performance and reproducibility.
This flow can be adapted by modifying cleaning rules for different data sources or languages, implementing alternative deterministic splitting methods, or incorporating different tokenization corpus generation strategies. The EOS delimiter can be customized based on the specific model architecture's requirements.