This diagram illustrates two parallel workflows for developing and deploying language models: a self-developed MicroLM pipeline and a Qwen-based migration
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b", "clusterBkg": "#f8fafc", "clusterBorder": "#cbd5e1"}}}%%
flowchart TB
subgraph A["主线 A+B:自研 MicroLM"]
direction TB
A1["原始中文语料<br>~141 万条"] --> A2["prepare_pretrain_jsonl.py<br>清洗 · 去重 · 切分 · EOS"]
A2 --> A3["bpe.py<br>byte-level BPE · vocab=6400"]
A3 --> A4["tokenizer.py / tokenization<br>文本 → token ids → memmap"]
A4 --> A5["train_pretrain.py<br>TransformerLM 预训练 · 31.7M"]
A5 --> A6["train_sft.py<br>全参 SFT"]
A5 --> A7["lora.py + LoRA 训练<br>0.83% 可训练参数"]
A6 --> A8["推理与系统化<br>generate_text · chat.py · KV Cache"]
A7 --> A8
end
subgraph B["主线 C+D:Qwen 迁移与结构化输出"]
direction TB
B1["InstructIE 原始数据<br>171K · 12 topics"] --> B2["01~06 数据 pipeline<br>标准化 → 过滤 → 分层 → 派生 → 采样 → 转写"]
B2 --> B3["train_qwen_lora.py<br>Qwen2.5-1.5B + PEFT LoRA"]
B3 --> B4["run_instructie_eval.py<br>4 模型 × 40 prompts × 4 指标"]
B4 --> B5["export_final_model.py<br>merge adaptor → HF 目录"]
B5 --> B6["vLLM 部署与验证<br>serve · smoke · benchmark · stability"]
end
A8 -. "评测阶段汇合" .-> B4
classDef self fill:#eff6ff,stroke:#60a5fa,stroke-width:1.4px,color:#0f172a;
classDef qwen fill:#f0fdf4,stroke:#22c55e,stroke-width:1.4px,color:#0f172a;
class A1,A2,A3,A4,A5,A6,A7,A8 self;
class B1,B2,B3,B4,B5,B6 qwen;
The diagram presents two distinct but related pipelines for building and deploying large language models. The first, "Self-developed MicroLM," details the process from raw Chinese corpus preprocessing, custom BPE tokenizer creation, pre-training a Transformer LM, to full-parameter SFT or LoRA fine-tuning, culminating in inference and systematization. The second, "Qwen Migration and Structured Output," focuses on migrating and fine-tuning Qwen2.5-1.5B with PEFT LoRA using InstructIE data, followed by structured evaluation, model merging, and vLLM deployment. Both pipelines emphasize fine-tuning, structured evaluation, and deployment, sharing common methodologies like configuration-driven processes and evaluation-first approaches, and converging for a unified evaluation.
This diagram is useful for understanding and comparing different approaches to developing and deploying custom language models versus adapting existing open-source models. It's ideal for projects involving custom tokenizer development, pre-training, SFT, LoRA fine-tuning, structured evaluation, and efficient model deployment (e.g., with vLLM). It can guide teams in setting up robust ML engineering workflows for LLMs.
This diagram can be adapted by replacing specific components with alternatives, such as using different pre-training datasets, tokenizer algorithms (e.g., SentencePiece), base models (e.g., Llama, Mistral), fine-tuning techniques (e.g., DPO, PPO), or deployment frameworks (e.g., TGI, Triton Inference Server). The evaluation framework can be extended with more metrics or datasets. The data pipelines can be customized for different data sources and tasks.