Language Model Tokenization and Memmap Data Preparation

ML & AI · flowchart diagram · MIT

Illustrates the process of converting raw text data into token ID sequences and storing them in memory-mapped files for efficient language model training.

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
Tokenization Memmap Data Preparation Language Model NLP Machine Learning BPE

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    X1["train.txt / valid.txt"] --> X2["BPETokenizer.encode()"]
    X2 --> X3["长 token id 序列"]
    X3 --> X4["写入 .npy memmap"]
    X4 --> X5["train_ids.npy / valid_ids.npy"]

    classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    class X1,X2,X3,X4 data;
    class X5 out;

What this diagram shows

This flowchart depicts the data preparation pipeline for a language model, starting from raw text files (train.txt, valid.txt), through BPE tokenization, conversion into a long token ID sequence, and finally storage as memory-mapped .npy files (train_ids.npy, valid_ids.npy) for efficient training.

When to use it

Use this diagram to understand or design data ingestion and tokenization workflows for large language models, especially when dealing with large corpora that require memory-efficient storage and access during training.

How to adapt it for your project

Adapt this workflow by substituting BPETokenizer with other tokenizers (e.g., WordPiece, SentencePiece), changing the output format from .npy to other memory-mapped or streaming formats, or integrating pre-processing steps like text cleaning or normalization before tokenization.

Key concepts