MicroLM Transformer Forward Path

ML & AI · flowchart diagram · MIT

Details the forward pass of a MicroLM Transformer model, showing the flow from input_ids through Embedding, Transformer Blocks, MultiHeadSelfAttention, Swi

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
Transformer LLM Deep Learning Neural Network Forward Pass MicroLM Attention Mechanism

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f8fafc", "primaryBorderColor": "#94a3b8", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    I["input_ids(B, T)"] --> E["Embedding<br>(B, T, d_model)"]
    E --> B["TransformerBlock × 8"]
    B --> N1["RMSNorm"]
    N1 --> A["MultiHeadSelfAttention<br>Q/K/V · RoPE · causal mask"]
    A --> R1["Residual Add"]
    R1 --> N2["RMSNorm"]
    N2 --> F["SwiGLU FFN"]
    F --> R2["Residual Add"]
    R2 --> FN["Final RMSNorm"]
    FN --> H["lm_head"]
    H --> L["logits<br>(B, T, vocab_size)"]

    classDef input fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef core fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
    classDef attn fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    classDef ffn fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    classDef out fill:#fdf4ff,stroke:#d946ef,color:#0f172a;
    class I,E input;
    class B,N1,N2,R1,R2,FN core;
    class A attn;
    class F ffn;
    class H,L out;

What this diagram shows

This flowchart illustrates the forward pass of a MicroLM Transformer model. It begins with input_ids, which are processed by an Embedding layer. The embedded tokens then pass through 8 Transformer Blocks, each featuring RMSNorm, MultiHeadSelfAttention with RoPE and causal masking, and a SwiGLU Feed-Forward Network, all connected with residual additions. A final RMSNorm and lm_head produce the output logits.

When to use it

Use this diagram to understand the sequential data flow and key components within a modern, simplified Transformer architecture, such as MicroLM. It's ideal for learning about the inference or training path of an LLM, specifically how input tokens are transformed into output logits through attention, FFNs, and normalization layers.

How to adapt it for your project

To adapt this diagram, one could modify the number of Transformer Blocks, specify different attention mechanisms, or detail the internal operations of RoPE or SwiGLU. It can also be extended to show specific data shapes at each stage, or to integrate with other components like KV Cache for inference optimization.

Key concepts

Transformer Architecture
Multi-Head Self-Attention
RMSNorm
SwiGLU FFN
Rotary Positional Embeddings (RoPE)