Illustrates the internal forward pass of a MicroLM Transformer model, detailing components like embedding, multi-head attention, SwiGLU FFN, and RMSNorm.
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f8fafc", "primaryBorderColor": "#94a3b8", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
I["input_ids(B, T)"] --> E["Embedding<br>(B, T, d_model)"]
E --> B["TransformerBlock × 8"]
B --> N1["RMSNorm"]
N1 --> A["MultiHeadSelfAttention<br>Q/K/V · RoPE · causal mask"]
A --> R1["Residual Add"]
R1 --> N2["RMSNorm"]
N2 --> F["SwiGLU FFN"]
F --> R2["Residual Add"]
R2 --> FN["Final RMSNorm"]
FN --> H["lm_head"]
H --> L["logits<br>(B, T, vocab_size)"]
classDef input fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef core fill:#f8fafc,stroke:#94a3b8,color:#0f172a;
classDef attn fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
classDef ffn fill:#fff7ed,stroke:#fb923c,color:#0f172a;
classDef out fill:#fdf4ff,stroke:#d946ef,color:#0f172a;
class I,E input;
class B,N1,N2,R1,R2,FN core;
class A attn;
class F ffn;
class H,L out;
This flowchart details the forward pass of a MicroLM Transformer Language Model. It starts with input_ids, which are converted to embeddings. These pass through 8 Transformer Blocks, each featuring pre-norm RMSNorm, MultiHeadSelfAttention (with Q/K/V, RoPE, and causal masking), residual connections, another RMSNorm, and a SwiGLU Feed-Forward Network. A final RMSNorm and an lm_head produce the output logits for language generation.
Use this diagram to understand the architectural flow of modern Transformer-based language models, especially those incorporating features like RoPE, SwiGLU, and pre-norm RMSNorm. It's ideal for learning about LLM inference paths, custom layer implementations, or preparing for LoRA integration.
This diagram can be adapted to represent other Transformer variants by changing the number of blocks, modifying attention mechanisms (e.g., adding cross-attention), or swapping FFN types. You could also detail the internal structure of the MultiHeadSelfAttention or SwiGLU FFN blocks. Adding KV Cache steps would extend it for inference optimization.