GPT-2 Style Byte-Pair Encoding (BPE) Tokenizer Training

ML & AI · flowchart diagram · MIT

Illustrates the step-by-step process of training a GPT-2 style Byte-Pair Encoding (BPE) tokenizer, from initial corpus to final vocabulary files.

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
BPE Tokenizer NLP Machine Learning Large Language Models Vocabulary Data Preprocessing

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    T0["tokenizer_corpus.txt"] --> T1["GPT-2 风格预分词<br>中文字符独立成词"]
    T1 --> T2["byte-level 初始词表<br>256 bytes"]
    T2 --> T3["统计相邻 pair 频率"]
    T3 --> T4["迭代 merge<br>直到 vocab_size=6400"]
    T4 --> T5["产物<br>vocab.json + merge.txt"]

    classDef tok fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef merge fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    class T0,T1,T2,T3 tok;
    class T4 merge;
    class T5 out;

What this diagram shows

This flowchart details the training pipeline for a Byte-Pair Encoding (BPE) tokenizer, specifically in a GPT-2 style. It begins with a tokenizer_corpus.txt file, proceeds through GPT-2 style pre-tokenization where Chinese characters are treated as individual words, establishes an initial byte-level vocabulary of 256 bytes, then iteratively merges the most frequent adjacent byte pairs until a target vocab_size (e.g., 6400) is reached. The final outputs are vocab.json and merge.txt files.

When to use it

Use this diagram when designing or implementing a custom tokenizer for large language models (LLMs) or other natural language processing (NLP) tasks, especially when aiming for a subword tokenization strategy like BPE. It's particularly relevant for models requiring a specific vocabulary size or handling multi-byte character sets like Chinese.

How to adapt it for your project

This process can be adapted by changing the initial pre-tokenization rules for different languages or data types, adjusting the target vocab_size based on model size and performance requirements, or modifying the criteria for merging pairs (e.g., using different frequency metrics). The initial byte-level vocabulary can also be customized.

Key concepts