LLM Pretraining Workflow

ML & AI · flowchart diagram · MIT

This diagram illustrates a typical pretraining loop for a Language Model, covering data loading, forward pass, loss calculation, backpropagation, optimizat

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM Pretraining Deep Learning Machine Learning Optimization PyTorch Training Loop

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    P1["train_ids.npy / valid_ids.npy"] --> P2["get_batch()<br>随机切窗口 x / y"]
    P2 --> P3["model(x)"]
    P3 --> P4["cross_entropy(logits, y)"]
    P4 --> P5["backward()"]
    P5 --> P6["AdamW + cosine scheduler + gradient clipping"]
    P6 --> P7["checkpoint / train_log.jsonl / wandb"]

    classDef train fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    classDef loss fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    classDef out fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    class P1,P2,P3,P5,P6 train;
    class P4 loss;
    class P7 out;

What this diagram shows

The diagram details the pretraining process for a Language Model. It starts with loading tokenized data (train_ids.npy, valid_ids.npy), then uses get_batch() to randomly sample windows for input x and target y. The model performs a forward pass (model(x)), followed by cross_entropy loss calculation. Gradients are computed via backward(), and the model parameters are updated using AdamW with a cosine scheduler and gradient clipping. Finally, checkpoints, training logs, and wandb are used for monitoring and saving progress.

When to use it

Use this workflow when pretraining large language models or any deep learning model that processes sequential data, especially when dealing with long sequences and requiring robust optimization strategies.

How to adapt it for your project

This workflow can be adapted by integrating different optimizers (e.g., SGD, Adagrad), varying learning rate schedules (e.g., linear decay, step decay), incorporating advanced data augmentation techniques, or implementing distributed training for larger datasets and models.

Key concepts

Random Window Sampling
Cross-Entropy Loss
AdamW Optimizer
Cosine Learning Rate Scheduler
Gradient Clipping