This flowchart illustrates a structured evaluation process for Large Language Models (LLMs) to assess their ability to produce accurate and well-formed JSO
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
B1["4 个模型<br>qwen_base / qwen_lora / microlm_sft / microlm_lora"] --> B2["40 条结构化 prompts"]
B2 --> B3["统一自动检测"]
B3 --> B4["Parse% / Strict% / Alias-Strict%"]
B3 --> B5["缺字段 / 幻觉字段 / 实体作 key / 字段名重叠"]
B4 --> B6["横向对比报告"]
B5 --> B6
B6 --> B7["部署决策<br>推荐 qwen_lora"]
classDef eval fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
classDef metric fill:#fff7ed,stroke:#fb923c,color:#0f172a;
classDef out fill:#f5f3ff,stroke:#8b5cf6,color:#0f172a;
class B1,B2,B3 eval;
class B4,B5 metric;
class B6,B7 out;
This diagram details a structured evaluation pipeline for comparing four different Large Language Models (qwen_base, qwen_lora, microlm_sft, microlm_lora). It outlines the use of 40 structured prompts, automatic detection of output quality, and the calculation of metrics like Parse%, Strict%, and Alias-Strict%. It also identifies common failure modes such as missing fields, hallucinated fields, entities used as keys, and overlapping field names. The process culminates in a cross-comparison report leading to a deployment decision, specifically recommending qwen_lora.
Use this diagram when evaluating and comparing Large Language Models for tasks requiring precise, structured output, such as JSON generation. It's ideal for selecting the best-performing model for deployment based on objective, quantifiable metrics rather than subjective assessment.
This evaluation flow can be adapted by changing the number and types of models under test, modifying the structured prompts to target specific use cases or output formats, or introducing additional evaluation metrics. The detection criteria for errors can be refined, and the reporting format can be customized to highlight different aspects of model performance.