Illustrates the process of converting raw text data into token ID sequences and storing them in memory-mapped files for efficient language model training.
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
X1["train.txt / valid.txt"] --> X2["BPETokenizer.encode()"]
X2 --> X3["长 token id 序列"]
X3 --> X4["写入 .npy memmap"]
X4 --> X5["train_ids.npy / valid_ids.npy"]
classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
class X1,X2,X3,X4 data;
class X5 out;
This flowchart depicts the data preparation pipeline for a language model, starting from raw text files (train.txt, valid.txt), through BPE tokenization, conversion into a long token ID sequence, and finally storage as memory-mapped .npy files (train_ids.npy, valid_ids.npy) for efficient training.
Use this diagram to understand or design data ingestion and tokenization workflows for large language models, especially when dealing with large corpora that require memory-efficient storage and access during training.
Adapt this workflow by substituting BPETokenizer with other tokenizers (e.g., WordPiece, SentencePiece), changing the output format from .npy to other memory-mapped or streaming formats, or integrating pre-processing steps like text cleaning or normalization before tokenization.