This flowchart illustrates the sequential steps for preparing a raw Chinese corpus for Byte Pair Encoding (BPE) training, covering cleaning, splitting, and
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
D0["原始中文语料<br>~141 万条"] --> D1["清洗与规整<br>去重 · HTML 清理 · 空白压缩 · 长度过滤"]
D1 --> D2["确定性 train-valid 切分<br>SHA1 哈希"]
D2 --> D3["插入 EOS 文档分隔符"]
D3 --> D4["产物<br>train.txt · valid.txt · tokenizer_corpus.txt"]
classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef step fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
class D0 data;
class D1,D2,D3 step;
class D4 out;
This diagram outlines the process of preparing raw Chinese corpus data for BPE training. It starts with a large raw corpus, proceeds through cleaning and normalization steps (deduplication, HTML cleanup, whitespace compression, length filtering), then performs a deterministic train-valid split using SHA1 hashing, and finally inserts EOS (End-Of-Sentence/Document) delimiters. The output consists of train.txt, valid.txt, and tokenizer_corpus.txt.
Use this diagram when designing or documenting a data preparation pipeline for training a BPE tokenizer for large language models, especially when reproducibility, explicit document boundaries, and robust cleaning are critical. It's suitable for initial data ingestion and preprocessing stages.
This flow can be adapted for different languages by modifying cleaning rules and tokenizer corpus generation. Additional cleaning steps can be integrated, or alternative splitting methods can be used. The EOS delimiter can be replaced with other tokenization strategies, and the output formats can be adjusted to fit specific model training requirements.