LLM Evaluation Path for MicroLM

ML & AI · flowchart diagram · MIT

A flowchart illustrating the self-developed evaluation path for LLMs, comparing pretrain, baseline, and LoRA checkpoints using fixed prompts and human scor

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM AI Machine Learning Evaluation Fine-tuning LoRA Prompt

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    A1["pretrain / baseline / lora checkpoint"] --> A2["固定 prompt 集"]
    A2 --> A3["生成输出"]
    A3 --> A4["人工评分 / 质量对比"]
    A4 --> A5["能力边界判断<br>对话意识 · 中文输出 · 长输出稳定性 · JSON 能力"]

    classDef eval fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef out fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    class A1,A2,A3,A4 eval;
    class A5 out;

What this diagram shows

This diagram illustrates a self-developed evaluation process for large language models (LLMs), specifically comparing different model checkpoints (pretrain, baseline, LoRA). It details the steps from using a fixed prompt set to generating outputs, human scoring, and finally assessing model capabilities like conversational awareness, Chinese output, long output stability, and JSON generation.

When to use it

Use this diagram when designing an evaluation pipeline for fine-tuned or smaller language models, especially when comparing different training strategies like LoRA or SFT. It's suitable for assessing specific capabilities and identifying model limitations.

How to adapt it for your project

Adapt this by changing the '固定 prompt 集' to include prompts relevant to your specific use case or domain. The '人工评分 / 质量对比' can be augmented with automated metrics where applicable. The '能力边界判断' can be expanded to include other critical capabilities relevant to the model's application.

Key concepts