Text Tokenization and Memmap Data Preparation

Data Pipelines · flowchart diagram · MIT

This flowchart illustrates the process of preparing text data for machine learning models, involving BPE tokenization and efficient storage using memmap .n

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
Tokenization BPE Data Pipeline NLP Machine Learning Memmap Data Preparation

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    X1["train.txt / valid.txt"] --> X2["BPETokenizer.encode()"]
    X2 --> X3["长 token id 序列"]
    X3 --> X4["写入 .npy memmap"]
    X4 --> X5["train_ids.npy / valid_ids.npy"]

    classDef data fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    class X1,X2,X3,X4 data;
    class X5 out;

What this diagram shows

The diagram depicts the transformation of raw text files (train.txt, valid.txt) into tokenized sequences. It starts with BPETokenizer encoding the text, generating a long sequence of token IDs. This sequence is then written to a .npy memmap file, resulting in train_ids.npy and valid_ids.npy, which are ready for model training.

When to use it

Use this process when preparing large text datasets for training language models or other NLP tasks, where efficient memory management and fast data access are crucial. It's particularly useful for datasets that cannot fit entirely into RAM.

How to adapt it for your project

This workflow can be adapted by using different tokenizers (e.g., WordPiece, SentencePiece), modifying the tokenization parameters, or choosing alternative data storage formats. The memmap step can be replaced with in-memory processing for smaller datasets or integrated with distributed file systems for even larger scale.

Key concepts

BPE Tokenization
Data Preprocessing
Memory Mapping
Token ID Sequence
Language Model Training