MicroLM Transformer Language Model Forward Pass

ML & AI · flowchart diagram · MIT

Illustrates the internal forward pass of a MicroLM Transformer model, detailing components like embedding, multi-head attention, SwiGLU FFN, and RMSNorm.

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM Transformer Deep Learning Neural Network Forward Pass MicroLM AI Model

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f8fafc", "primaryBorderColor": "#94a3b8", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    I["input_ids(B, T)"] --> E["Embedding<br>(B, T, d_model)"]
    E --> B["TransformerBlock × 8"]
    B --> N1["RMSNorm"]
    N1 --> A["MultiHeadSelfAttention<br>Q/K/V · RoPE · causal mask"]
    A --> R1["Residual Add"]
    R1 --> N2["RMSNorm"]
    N2 --> F["SwiGLU FFN"]
    F --> R2["Residual Add"]
    R2 --> FN["Final RMSNorm"]
    FN --> H["lm_head"]
    H --> L["logits<br>(B, T, vocab_size)"]

    classDef input fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef core fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
    classDef attn fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    classDef ffn fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    classDef out fill:#fdf4ff,stroke:#d946ef,color:#0f172a;
    class I,E input;
    class B,N1,N2,R1,R2,FN core;
    class A attn;
    class F ffn;
    class H,L out;

What this diagram shows

This flowchart details the forward pass of a MicroLM Transformer Language Model. It starts with input_ids, which are converted to embeddings. These pass through 8 Transformer Blocks, each featuring pre-norm RMSNorm, MultiHeadSelfAttention (with Q/K/V, RoPE, and causal masking), residual connections, another RMSNorm, and a SwiGLU Feed-Forward Network. A final RMSNorm and an lm_head produce the output logits for language generation.

When to use it

Use this diagram to understand the architectural flow of modern Transformer-based language models, especially those incorporating features like RoPE, SwiGLU, and pre-norm RMSNorm. It's ideal for learning about LLM inference paths, custom layer implementations, or preparing for LoRA integration.

How to adapt it for your project

This diagram can be adapted to represent other Transformer variants by changing the number of blocks, modifying attention mechanisms (e.g., adding cross-attention), or swapping FFN types. You could also detail the internal structure of the MultiHeadSelfAttention or SwiGLU FFN blocks. Adding KV Cache steps would extend it for inference optimization.

Key concepts

Transformer Architecture
Multi-Head Self-Attention
RoPE Positional Embeddings
SwiGLU FFN
RMSNorm