MicroLM Development & Qwen Migration Workflow

ML & AI · flowchart diagram · MIT

This diagram illustrates two parallel workflows for developing and deploying language models: a self-developed MicroLM pipeline and a Qwen-based migration

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM Fine-tuning LoRA Qwen vLLM Machine Learning MLOps

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b", "clusterBkg": "#f8fafc", "clusterBorder": "#cbd5e1"}}}%%
flowchart TB
    subgraph A["主线 A+B:自研 MicroLM"]
      direction TB
      A1["原始中文语料<br>~141 万条"] --> A2["prepare_pretrain_jsonl.py<br>清洗 · 去重 · 切分 · EOS"]
      A2 --> A3["bpe.py<br>byte-level BPE · vocab=6400"]
      A3 --> A4["tokenizer.py / tokenization<br>文本 → token ids → memmap"]
      A4 --> A5["train_pretrain.py<br>TransformerLM 预训练 · 31.7M"]
      A5 --> A6["train_sft.py<br>全参 SFT"]
      A5 --> A7["lora.py + LoRA 训练<br>0.83% 可训练参数"]
      A6 --> A8["推理与系统化<br>generate_text · chat.py · KV Cache"]
      A7 --> A8
    end

    subgraph B["主线 C+D:Qwen 迁移与结构化输出"]
      direction TB
      B1["InstructIE 原始数据<br>171K · 12 topics"] --> B2["01~06 数据 pipeline<br>标准化 → 过滤 → 分层 → 派生 → 采样 → 转写"]
      B2 --> B3["train_qwen_lora.py<br>Qwen2.5-1.5B + PEFT LoRA"]
      B3 --> B4["run_instructie_eval.py<br>4 模型 × 40 prompts × 4 指标"]
      B4 --> B5["export_final_model.py<br>merge adaptor → HF 目录"]
      B5 --> B6["vLLM 部署与验证<br>serve · smoke · benchmark · stability"]
    end

    A8 -. "评测阶段汇合" .-> B4

    classDef self fill:#eff6ff,stroke:#60a5fa,stroke-width:1.4px,color:#0f172a;
    classDef qwen fill:#f0fdf4,stroke:#22c55e,stroke-width:1.4px,color:#0f172a;
    class A1,A2,A3,A4,A5,A6,A7,A8 self;
    class B1,B2,B3,B4,B5,B6 qwen;

What this diagram shows

The diagram presents two distinct but related pipelines for building and deploying large language models. The first, "Self-developed MicroLM," details the process from raw Chinese corpus preprocessing, custom BPE tokenizer creation, pre-training a Transformer LM, to full-parameter SFT or LoRA fine-tuning, culminating in inference and systematization. The second, "Qwen Migration and Structured Output," focuses on migrating and fine-tuning Qwen2.5-1.5B with PEFT LoRA using InstructIE data, followed by structured evaluation, model merging, and vLLM deployment. Both pipelines emphasize fine-tuning, structured evaluation, and deployment, sharing common methodologies like configuration-driven processes and evaluation-first approaches, and converging for a unified evaluation.

When to use it

This diagram is useful for understanding and comparing different approaches to developing and deploying custom language models versus adapting existing open-source models. It's ideal for projects involving custom tokenizer development, pre-training, SFT, LoRA fine-tuning, structured evaluation, and efficient model deployment (e.g., with vLLM). It can guide teams in setting up robust ML engineering workflows for LLMs.

How to adapt it for your project

This diagram can be adapted by replacing specific components with alternatives, such as using different pre-training datasets, tokenizer algorithms (e.g., SentencePiece), base models (e.g., Llama, Mistral), fine-tuning techniques (e.g., DPO, PPO), or deployment frameworks (e.g., TGI, Triton Inference Server). The evaluation framework can be extended with more metrics or datasets. The data pipelines can be customized for different data sources and tasks.

Key concepts