BPE Training Path for MicroLM

ML & AI · flowchart diagram · MIT

This flowchart illustrates the Byte Pair Encoding (BPE) training process for MicroLM, from initial corpus preparation to generating a custom 6400-token voc

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
BPE Tokenizer Vocabulary NLP Language Model MicroLM AI

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    T0["tokenizer_corpus.txt"] --> T1["GPT-2 风格预分词<br>中文字符独立成词"]
    T1 --> T2["byte-level 初始词表<br>256 bytes"]
    T2 --> T3["统计相邻 pair 频率"]
    T3 --> T4["迭代 merge<br>直到 vocab_size=6400"]
    T4 --> T5["产物<br>vocab.json + merge.txt"]

    classDef tok fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef merge fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    class T0,T1,T2,T3 tok;
    class T4 merge;
    class T5 out;

What this diagram shows

This diagram illustrates the Byte Pair Encoding (BPE) training pipeline for the MicroLM project. It starts with a tokenizer_corpus.txt which undergoes GPT-2 style pre-tokenization, treating Chinese characters as independent words. This leads to an initial byte-level vocabulary of 256 bytes. The process then iteratively merges the most frequent adjacent byte pairs until the vocabulary size reaches 6400. The final outputs are vocab.json and merge.txt, which define the model's input language.

When to use it

Use this diagram when designing or explaining the tokenization process for a new language model, particularly when a custom Byte Pair Encoding (BPE) tokenizer is required with a specific vocabulary size. It's relevant for understanding how a model's input language is defined from raw text.

How to adapt it for your project

This process can be adapted by changing the initial pre-tokenization rules (e.g., for different languages or specific tokenization needs), modifying the initial vocabulary (e.g., starting with a larger base), or adjusting the target vocab_size to suit different model sizes or language complexities. The corpus used for training can also be swapped.

Key concepts

Byte Pair Encoding
Tokenization
Vocabulary Training
Language Model
Pre-training