This flowchart illustrates a structured evaluation process for comparing Large Language Models (LLMs) on their ability to produce accurate and usable JSON
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
B1["4 个模型<br>qwen_base / qwen_lora / microlm_sft / microlm_lora"] --> B2["40 条结构化 prompts"]
B2 --> B3["统一自动检测"]
B3 --> B4["Parse% / Strict% / Alias-Strict%"]
B3 --> B5["缺字段 / 幻觉字段 / 实体作 key / 字段名重叠"]
B4 --> B6["横向对比报告"]
B5 --> B6
B6 --> B7["部署决策<br>推荐 qwen_lora"]
classDef eval fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
classDef metric fill:#fff7ed,stroke:#fb923c,color:#0f172a;
classDef out fill:#f5f3ff,stroke:#8b5cf6,color:#0f172a;
class B1,B2,B3 eval;
class B4,B5 metric;
class B6,B7 out;
The diagram outlines a structured evaluation pipeline for comparing four different LLM models (qwen_base, qwen_lora, microlm_sft, microlm_lora). It starts with feeding 40 structured prompts to these models, followed by unified automatic detection. The evaluation yields quantitative metrics like Parse%, Strict%, and Alias-Strict%, along with qualitative observations such as missing fields, hallucinated fields, entities as keys, and overlapping field names. These metrics and observations are then used to generate a comparative report, leading to a deployment decision, specifically recommending `qwen_lora`.
Use this diagram when evaluating and comparing different Large Language Models (LLMs) or fine-tuned versions for their performance in generating structured outputs, especially JSON. It's suitable for scenarios requiring objective metrics to inform deployment decisions in AI/ML projects.
This process can be adapted by changing the set of LLM models under evaluation, varying the number and complexity of structured prompts, or introducing different evaluation metrics specific to the desired output format (e.g., XML, YAML). The detection criteria for errors (missing fields, hallucination) can also be customized to fit specific application requirements.