This flowchart illustrates the Byte Pair Encoding (BPE) training process for MicroLM, from initial corpus preparation to generating a custom 6400-token voc
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
T0["tokenizer_corpus.txt"] --> T1["GPT-2 风格预分词<br>中文字符独立成词"]
T1 --> T2["byte-level 初始词表<br>256 bytes"]
T2 --> T3["统计相邻 pair 频率"]
T3 --> T4["迭代 merge<br>直到 vocab_size=6400"]
T4 --> T5["产物<br>vocab.json + merge.txt"]
classDef tok fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef merge fill:#fff7ed,stroke:#fb923c,color:#0f172a;
classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
class T0,T1,T2,T3 tok;
class T4 merge;
class T5 out;
This diagram illustrates the Byte Pair Encoding (BPE) training pipeline for the MicroLM project. It starts with a tokenizer_corpus.txt which undergoes GPT-2 style pre-tokenization, treating Chinese characters as independent words. This leads to an initial byte-level vocabulary of 256 bytes. The process then iteratively merges the most frequent adjacent byte pairs until the vocabulary size reaches 6400. The final outputs are vocab.json and merge.txt, which define the model's input language.
Use this diagram when designing or explaining the tokenization process for a new language model, particularly when a custom Byte Pair Encoding (BPE) tokenizer is required with a specific vocabulary size. It's relevant for understanding how a model's input language is defined from raw text.
This process can be adapted by changing the initial pre-tokenization rules (e.g., for different languages or specific tokenization needs), modifying the initial vocabulary (e.g., starting with a larger base), or adjusting the target vocab_size to suit different model sizes or language complexities. The corpus used for training can also be swapped.