A flowchart illustrating the self-developed evaluation path for LLMs, comparing pretrain, baseline, and LoRA checkpoints using fixed prompts and human scor
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
A1["pretrain / baseline / lora checkpoint"] --> A2["固定 prompt 集"]
A2 --> A3["生成输出"]
A3 --> A4["人工评分 / 质量对比"]
A4 --> A5["能力边界判断<br>对话意识 · 中文输出 · 长输出稳定性 · JSON 能力"]
classDef eval fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef out fill:#fff7ed,stroke:#fb923c,color:#0f172a;
class A1,A2,A3,A4 eval;
class A5 out;
This diagram illustrates a self-developed evaluation process for large language models (LLMs), specifically comparing different model checkpoints (pretrain, baseline, LoRA). It details the steps from using a fixed prompt set to generating outputs, human scoring, and finally assessing model capabilities like conversational awareness, Chinese output, long output stability, and JSON generation.
Use this diagram when designing an evaluation pipeline for fine-tuned or smaller language models, especially when comparing different training strategies like LoRA or SFT. It's suitable for assessing specific capabilities and identifying model limitations.
Adapt this by changing the '固定 prompt 集' to include prompts relevant to your specific use case or domain. The '人工评分 / 质量对比' can be augmented with automated metrics where applicable. The '能力边界判断' can be expanded to include other critical capabilities relevant to the model's application.