Illustrates the InstructIE six-step data pipeline, transforming 171K raw data into 28.5K structured, auditable training sets for LLMs.
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
Q1["InstructIE 原始数据<br>171K"] --> Q2["01_normalize.py<br>字段标准化"]
Q2 --> Q3["02_filter.py<br>硬过滤 + per-topic P99"]
Q3 --> Q4["03_quality_tier.py<br>high / medium / low"]
Q4 --> Q5["04_derive_tasks.py<br>4 类任务派生"]
Q5 --> Q6["05_stratified_sample.py<br>按 task/topic/quality 分层采样"]
Q6 --> Q7["06_to_chat_jsonl.py<br>转成 chat JSONL"]
Q7 --> Q8["产物<br>train.jsonl / valid.jsonl / metadata.json"]
classDef qdata fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef qpipe fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
classDef out fill:#fff7ed,stroke:#fb923c,color:#0f172a;
class Q1 qdata;
class Q2,Q3,Q4,Q5,Q6,Q7 qpipe;
class Q8 out;
This flowchart details the InstructIE data pipeline, starting with 171K raw data. It shows a sequence of six processing steps: field normalization, hard filtering with per-topic P99, quality tier assignment (high/medium/low), derivation of four task types, stratified sampling by task, topic, and quality, and finally, conversion to chat JSONL format. The pipeline culminates in producing train.jsonl, valid.jsonl, and metadata.json files.
Use this diagram when designing or explaining a robust, auditable data processing pipeline for large language model (LLM) training, particularly for tasks requiring structured output. It's suitable for scenarios where data quality, traceability, and systematic transformation are critical.
This pipeline can be adapted by modifying filtering thresholds in `conf.py`, adjusting quality tiering logic, defining different task derivation strategies, or implementing alternative stratified sampling methods. The modular design allows for easy swapping or modification of individual steps to suit specific dataset characteristics or model training requirements.