InstructIE Six-Step Data Pipeline

Data Pipelines · flowchart diagram · MIT

This diagram illustrates the InstructIE six-step data pipeline, transforming 171K raw data into 28.5K structured training data through an auditable, engine

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
Data Processing Machine Learning LLM Training Data Engineering InstructIE JSONL Pipeline

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    Q1["InstructIE 原始数据<br>171K"] --> Q2["01_normalize.py<br>字段标准化"]
    Q2 --> Q3["02_filter.py<br>硬过滤 + per-topic P99"]
    Q3 --> Q4["03_quality_tier.py<br>high / medium / low"]
    Q4 --> Q5["04_derive_tasks.py<br>4 类任务派生"]
    Q5 --> Q6["05_stratified_sample.py<br>按 task/topic/quality 分层采样"]
    Q6 --> Q7["06_to_chat_jsonl.py<br>转成 chat JSONL"]
    Q7 --> Q8["产物<br>train.jsonl / valid.jsonl / metadata.json"]

    classDef qdata fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef qpipe fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    classDef out fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    class Q1 qdata;
    class Q2,Q3,Q4,Q5,Q6,Q7 qpipe;
    class Q8 out;

What this diagram shows

A detailed six-step data processing pipeline for InstructIE, starting from 171K raw data, through normalization, filtering, quality tiering, task derivation, stratified sampling, and finally converting to chat JSONL format for training and validation.

When to use it

Use this diagram to understand or design a robust, auditable, and reproducible data pipeline for machine learning model training, particularly for structured output tasks or when preparing large datasets.

How to adapt it for your project

This pipeline can be adapted by modifying filtering thresholds, adding new quality tiers, deriving different types of tasks, or adjusting sampling strategies based on specific dataset characteristics and model training requirements.

Key concepts