This diagram illustrates a typical pretraining loop for a Language Model, covering data loading, forward pass, loss calculation, backpropagation, optimizat
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
P1["train_ids.npy / valid_ids.npy"] --> P2["get_batch()<br>随机切窗口 x / y"]
P2 --> P3["model(x)"]
P3 --> P4["cross_entropy(logits, y)"]
P4 --> P5["backward()"]
P5 --> P6["AdamW + cosine scheduler + gradient clipping"]
P6 --> P7["checkpoint / train_log.jsonl / wandb"]
classDef train fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
classDef loss fill:#fff7ed,stroke:#fb923c,color:#0f172a;
classDef out fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
class P1,P2,P3,P5,P6 train;
class P4 loss;
class P7 out;
The diagram details the pretraining process for a Language Model. It starts with loading tokenized data (train_ids.npy, valid_ids.npy), then uses get_batch() to randomly sample windows for input x and target y. The model performs a forward pass (model(x)), followed by cross_entropy loss calculation. Gradients are computed via backward(), and the model parameters are updated using AdamW with a cosine scheduler and gradient clipping. Finally, checkpoints, training logs, and wandb are used for monitoring and saving progress.
Use this workflow when pretraining large language models or any deep learning model that processes sequential data, especially when dealing with long sequences and requiring robust optimization strategies.
This workflow can be adapted by integrating different optimizers (e.g., SGD, Adagrad), varying learning rate schedules (e.g., linear decay, step decay), incorporating advanced data augmentation techniques, or implementing distributed training for larger datasets and models.