This diagram illustrates the InstructIE six-step data pipeline, transforming 171K raw data into 28.5K structured training data through an auditable, engine
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
Q1["InstructIE 原始数据<br>171K"] --> Q2["01_normalize.py<br>字段标准化"]
Q2 --> Q3["02_filter.py<br>硬过滤 + per-topic P99"]
Q3 --> Q4["03_quality_tier.py<br>high / medium / low"]
Q4 --> Q5["04_derive_tasks.py<br>4 类任务派生"]
Q5 --> Q6["05_stratified_sample.py<br>按 task/topic/quality 分层采样"]
Q6 --> Q7["06_to_chat_jsonl.py<br>转成 chat JSONL"]
Q7 --> Q8["产物<br>train.jsonl / valid.jsonl / metadata.json"]
classDef qdata fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef qpipe fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
classDef out fill:#fff7ed,stroke:#fb923c,color:#0f172a;
class Q1 qdata;
class Q2,Q3,Q4,Q5,Q6,Q7 qpipe;
class Q8 out;
A detailed six-step data processing pipeline for InstructIE, starting from 171K raw data, through normalization, filtering, quality tiering, task derivation, stratified sampling, and finally converting to chat JSONL format for training and validation.
Use this diagram to understand or design a robust, auditable, and reproducible data pipeline for machine learning model training, particularly for structured output tasks or when preparing large datasets.
This pipeline can be adapted by modifying filtering thresholds, adding new quality tiers, deriving different types of tasks, or adjusting sampling strategies based on specific dataset characteristics and model training requirements.