Large Language Model Structured Output Evaluation Flow

ML & AI · flowchart diagram · MIT

This flowchart illustrates a structured evaluation process for Large Language Models (LLMs) to assess their ability to produce accurate and well-formed JSO

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
LLM Evaluation Structured Data JSON Output Model Selection AI Testing Machine Learning

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    B1["4 个模型<br>qwen_base / qwen_lora / microlm_sft / microlm_lora"] --> B2["40 条结构化 prompts"]
    B2 --> B3["统一自动检测"]
    B3 --> B4["Parse% / Strict% / Alias-Strict%"]
    B3 --> B5["缺字段 / 幻觉字段 / 实体作 key / 字段名重叠"]
    B4 --> B6["横向对比报告"]
    B5 --> B6
    B6 --> B7["部署决策<br>推荐 qwen_lora"]

    classDef eval fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    classDef metric fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    classDef out fill:#f5f3ff,stroke:#8b5cf6,color:#0f172a;
    class B1,B2,B3 eval;
    class B4,B5 metric;
    class B6,B7 out;

What this diagram shows

This diagram details a structured evaluation pipeline for comparing four different Large Language Models (qwen_base, qwen_lora, microlm_sft, microlm_lora). It outlines the use of 40 structured prompts, automatic detection of output quality, and the calculation of metrics like Parse%, Strict%, and Alias-Strict%. It also identifies common failure modes such as missing fields, hallucinated fields, entities used as keys, and overlapping field names. The process culminates in a cross-comparison report leading to a deployment decision, specifically recommending qwen_lora.

When to use it

Use this diagram when evaluating and comparing Large Language Models for tasks requiring precise, structured output, such as JSON generation. It's ideal for selecting the best-performing model for deployment based on objective, quantifiable metrics rather than subjective assessment.

How to adapt it for your project

This evaluation flow can be adapted by changing the number and types of models under test, modifying the structured prompts to target specific use cases or output formats, or introducing additional evaluation metrics. The detection criteria for errors can be refined, and the reporting format can be customized to highlight different aspects of model performance.

Key concepts