This flowchart illustrates the process of preparing text data for machine learning models, involving BPE tokenization and efficient storage using memmap .n
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
X1["train.txt / valid.txt"] --> X2["BPETokenizer.encode()"]
X2 --> X3["长 token id 序列"]
X3 --> X4["写入 .npy memmap"]
X4 --> X5["train_ids.npy / valid_ids.npy"]
classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
class X1,X2,X3,X4 data;
class X5 out;
The diagram depicts the transformation of raw text files (train.txt, valid.txt) into tokenized sequences. It starts with BPETokenizer encoding the text, generating a long sequence of token IDs. This sequence is then written to a .npy memmap file, resulting in train_ids.npy and valid_ids.npy, which are ready for model training.
Use this process when preparing large text datasets for training language models or other NLP tasks, where efficient memory management and fast data access are crucial. It's particularly useful for datasets that cannot fit entirely into RAM.
This workflow can be adapted by using different tokenizers (e.g., WordPiece, SentencePiece), modifying the tokenization parameters, or choosing alternative data storage formats. The memmap step can be replaced with in-memory processing for smaller datasets or integrated with distributed file systems for even larger scale.