Flowchart detailing the MicroLM evaluation process from model checkpoints and prompt sets to human scoring and capability assessment, focusing on dialogue,
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
A1["pretrain / baseline / lora checkpoint"] --> A2["固定 prompt 集"]
A2 --> A3["生成输出"]
A3 --> A4["人工评分 / 质量对比"]
A4 --> A5["能力边界判断<br>对话意识 · 中文输出 · 长输出稳定性 · JSON 能力"]
classDef eval fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef out fill:#fff7ed,stroke:#fb923c,color:#0f172a;
class A1,A2,A3,A4 eval;
class A5 out;
A self-developed evaluation pipeline for language models (MicroLM), outlining steps from pretrain/baseline/LoRA checkpoints and fixed prompt sets to output generation, human scoring, and final capability judgment based on dialogue awareness, Chinese output, long output stability, and JSON generation.
To assess the performance of fine-tuned or LoRA-based language models, compare them against baselines, and identify specific strengths and weaknesses in areas like dialogue, language generation, and structured output capabilities.
Customize prompt sets for different evaluation tasks, integrate automated metrics alongside human scoring, or expand the capability judgment criteria to include more specific use cases or performance indicators relevant to the LLM's application.