MicroLM Project Global Overview: Self-Developed & Qwen Pipelines

ML & AI · flowchart diagram · MIT

Detailed overview of the MicroLM project, showcasing two parallel LLM development pipelines: a self-developed TransformerLM and a Qwen-based fine-tuning an

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM Fine-tuning LoRA Pretraining Data Processing vLLM Qwen

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b", "clusterBkg": "#f8fafc", "clusterBorder": "#cbd5e1"}}}%%
flowchart TB
    subgraph A["主线 A+B:自研 MicroLM"]
      direction TB
      A1["原始中文语料<br>~141 万条"] --> A2["prepare_pretrain_jsonl.py<br>清洗 · 去重 · 切分 · EOS"]
      A2 --> A3["bpe.py<br>byte-level BPE · vocab=6400"]
      A3 --> A4["tokenizer.py / tokenization<br>文本 → token ids → memmap"]
      A4 --> A5["train_pretrain.py<br>TransformerLM 预训练 · 31.7M"]
      A5 --> A6["train_sft.py<br>全参 SFT"]
      A5 --> A7["lora.py + LoRA 训练<br>0.83% 可训练参数"]
      A6 --> A8["推理与系统化<br>generate_text · chat.py · KV Cache"]
      A7 --> A8
    end

    subgraph B["主线 C+D:Qwen 迁移与结构化输出"]
      direction TB
      B1["InstructIE 原始数据<br>171K · 12 topics"] --> B2["01~06 数据 pipeline<br>标准化 → 过滤 → 分层 → 派生 → 采样 → 转写"]
      B2 --> B3["train_qwen_lora.py<br>Qwen2.5-1.5B + PEFT LoRA"]
      B3 --> B4["run_instructie_eval.py<br>4 模型 × 40 prompts × 4 指标"]
      B4 --> B5["export_final_model.py<br>merge adaptor → HF 目录"]
      B5 --> B6["vLLM 部署与验证<br>serve · smoke · benchmark · stability"]
    end

    A8 -. "评测阶段汇合" .-> B4

    classDef self fill:#eff6ff,stroke:#60a5fa,stroke-width:1.4px,color:#0f172a;
    classDef qwen fill:#f0fdf4,stroke:#22c55e,stroke-width:1.4px,color:#0f172a;
    class A1,A2,A3,A4,A5,A6,A7,A8 self;
    class B1,B2,B3,B4,B5,B6 qwen;

What this diagram shows

This diagram illustrates two distinct but related Large Language Model (LLM) development pipelines. The 'Self-developed MicroLM' line covers data preparation, byte-level BPE tokenization, pre-training a 31.7M TransformerLM, and subsequent full-parameter SFT or LoRA fine-tuning, culminating in inference. The 'Qwen Migration' line details a data pipeline for InstructIE, LoRA fine-tuning of Qwen2.5-1.5B, comprehensive evaluation across multiple models and metrics, model merging, and vLLM deployment. Both pipelines converge at the evaluation stage, comparing various model variants.

When to use it

Use this diagram when planning or understanding end-to-end LLM development workflows, particularly for projects involving both foundational model training from scratch and fine-tuning existing open-source models. It's useful for comparing different approaches to data preparation, model training, fine-tuning, and deployment strategies for LLMs.

How to adapt it for your project

This diagram can be adapted by modifying the initial data sources, replacing the self-developed TransformerLM with a different custom architecture, or substituting Qwen with another base LLM (e.g., Llama, Mistral). Different fine-tuning techniques (e.g., QLoRA, P-tuning) can be integrated. The evaluation metrics and deployment targets (e.g., TGI, OpenAI API) can also be customized to fit specific project requirements.

Key concepts