Detailed overview of the MicroLM project, showcasing two parallel LLM development pipelines: a self-developed TransformerLM and a Qwen-based fine-tuning an
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b", "clusterBkg": "#f8fafc", "clusterBorder": "#cbd5e1"}}}%%
flowchart TB
subgraph A["主线 A+B:自研 MicroLM"]
direction TB
A1["原始中文语料<br>~141 万条"] --> A2["prepare_pretrain_jsonl.py<br>清洗 · 去重 · 切分 · EOS"]
A2 --> A3["bpe.py<br>byte-level BPE · vocab=6400"]
A3 --> A4["tokenizer.py / tokenization<br>文本 → token ids → memmap"]
A4 --> A5["train_pretrain.py<br>TransformerLM 预训练 · 31.7M"]
A5 --> A6["train_sft.py<br>全参 SFT"]
A5 --> A7["lora.py + LoRA 训练<br>0.83% 可训练参数"]
A6 --> A8["推理与系统化<br>generate_text · chat.py · KV Cache"]
A7 --> A8
end
subgraph B["主线 C+D:Qwen 迁移与结构化输出"]
direction TB
B1["InstructIE 原始数据<br>171K · 12 topics"] --> B2["01~06 数据 pipeline<br>标准化 → 过滤 → 分层 → 派生 → 采样 → 转写"]
B2 --> B3["train_qwen_lora.py<br>Qwen2.5-1.5B + PEFT LoRA"]
B3 --> B4["run_instructie_eval.py<br>4 模型 × 40 prompts × 4 指标"]
B4 --> B5["export_final_model.py<br>merge adaptor → HF 目录"]
B5 --> B6["vLLM 部署与验证<br>serve · smoke · benchmark · stability"]
end
A8 -. "评测阶段汇合" .-> B4
classDef self fill:#eff6ff,stroke:#60a5fa,stroke-width:1.4px,color:#0f172a;
classDef qwen fill:#f0fdf4,stroke:#22c55e,stroke-width:1.4px,color:#0f172a;
class A1,A2,A3,A4,A5,A6,A7,A8 self;
class B1,B2,B3,B4,B5,B6 qwen;
This diagram illustrates two distinct but related Large Language Model (LLM) development pipelines. The 'Self-developed MicroLM' line covers data preparation, byte-level BPE tokenization, pre-training a 31.7M TransformerLM, and subsequent full-parameter SFT or LoRA fine-tuning, culminating in inference. The 'Qwen Migration' line details a data pipeline for InstructIE, LoRA fine-tuning of Qwen2.5-1.5B, comprehensive evaluation across multiple models and metrics, model merging, and vLLM deployment. Both pipelines converge at the evaluation stage, comparing various model variants.
Use this diagram when planning or understanding end-to-end LLM development workflows, particularly for projects involving both foundational model training from scratch and fine-tuning existing open-source models. It's useful for comparing different approaches to data preparation, model training, fine-tuning, and deployment strategies for LLMs.
This diagram can be adapted by modifying the initial data sources, replacing the self-developed TransformerLM with a different custom architecture, or substituting Qwen with another base LLM (e.g., Llama, Mistral). Different fine-tuning techniques (e.g., QLoRA, P-tuning) can be integrated. The evaluation metrics and deployment targets (e.g., TGI, OpenAI API) can also be customized to fit specific project requirements.