MicroLM Self-Developed Evaluation Path

ML & AI · flowchart diagram · MIT

Flowchart detailing the MicroLM evaluation process from model checkpoints and prompt sets to human scoring and capability assessment, focusing on dialogue,

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM AI Evaluation Fine-tuning LoRA Prompt MicroLM

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#eef6ff", "primaryBorderColor": "#60a5fa", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    A1["pretrain / baseline / lora checkpoint"] --> A2["固定 prompt 集"]
    A2 --> A3["生成输出"]
    A3 --> A4["人工评分 / 质量对比"]
    A4 --> A5["能力边界判断<br>对话意识 · 中文输出 · 长输出稳定性 · JSON 能力"]

    classDef eval fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef out fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    class A1,A2,A3,A4 eval;
    class A5 out;

What this diagram shows

A self-developed evaluation pipeline for language models (MicroLM), outlining steps from pretrain/baseline/LoRA checkpoints and fixed prompt sets to output generation, human scoring, and final capability judgment based on dialogue awareness, Chinese output, long output stability, and JSON generation.

When to use it

To assess the performance of fine-tuned or LoRA-based language models, compare them against baselines, and identify specific strengths and weaknesses in areas like dialogue, language generation, and structured output capabilities.

How to adapt it for your project

Customize prompt sets for different evaluation tasks, integrate automated metrics alongside human scoring, or expand the capability judgment criteria to include more specific use cases or performance indicators relevant to the LLM's application.

Key concepts

LLM Evaluation
Prompt Engineering
Human Scoring
Model Capability
Fine-tuning