LLM Structured Output Evaluation Flow for JSON Generation

ML & AI · flowchart diagram · MIT

This flowchart illustrates a structured evaluation process for comparing Large Language Models (LLMs) on their ability to produce accurate and usable JSON

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM AI Evaluation Structured Data JSON Output Machine Learning Model Benchmarking

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    B1["4 个模型<br>qwen_base / qwen_lora / microlm_sft / microlm_lora"] --> B2["40 条结构化 prompts"]
    B2 --> B3["统一自动检测"]
    B3 --> B4["Parse% / Strict% / Alias-Strict%"]
    B3 --> B5["缺字段 / 幻觉字段 / 实体作 key / 字段名重叠"]
    B4 --> B6["横向对比报告"]
    B5 --> B6
    B6 --> B7["部署决策<br>推荐 qwen_lora"]

    classDef eval fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    classDef metric fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    classDef out fill:#f5f3ff,stroke:#8b5cf6,color:#0f172a;
    class B1,B2,B3 eval;
    class B4,B5 metric;
    class B6,B7 out;

What this diagram shows

The diagram outlines a structured evaluation pipeline for comparing four different LLM models (qwen_base, qwen_lora, microlm_sft, microlm_lora). It starts with feeding 40 structured prompts to these models, followed by unified automatic detection. The evaluation yields quantitative metrics like Parse%, Strict%, and Alias-Strict%, along with qualitative observations such as missing fields, hallucinated fields, entities as keys, and overlapping field names. These metrics and observations are then used to generate a comparative report, leading to a deployment decision, specifically recommending `qwen_lora`.

When to use it

Use this diagram when evaluating and comparing different Large Language Models (LLMs) or fine-tuned versions for their performance in generating structured outputs, especially JSON. It's suitable for scenarios requiring objective metrics to inform deployment decisions in AI/ML projects.

How to adapt it for your project

This process can be adapted by changing the set of LLM models under evaluation, varying the number and complexity of structured prompts, or introducing different evaluation metrics specific to the desired output format (e.g., XML, YAML). The detection criteria for errors (missing fields, hallucination) can also be customized to fit specific application requirements.

Key concepts