InstructIE Six-Step Data Pipeline

Data Pipelines · flowchart diagram · MIT

Illustrates the InstructIE six-step data pipeline, transforming 171K raw data into 28.5K structured, auditable training sets for LLMs.

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
Data Processing LLM Training Data Pipeline JSONL Normalization Sampling

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f0fdf4", "primaryBorderColor": "#22c55e", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    Q1["InstructIE 原始数据<br>171K"] --> Q2["01_normalize.py<br>字段标准化"]
    Q2 --> Q3["02_filter.py<br>硬过滤 + per-topic P99"]
    Q3 --> Q4["03_quality_tier.py<br>high / medium / low"]
    Q4 --> Q5["04_derive_tasks.py<br>4 类任务派生"]
    Q5 --> Q6["05_stratified_sample.py<br>按 task/topic/quality 分层采样"]
    Q6 --> Q7["06_to_chat_jsonl.py<br>转成 chat JSONL"]
    Q7 --> Q8["产物<br>train.jsonl / valid.jsonl / metadata.json"]

    classDef qdata fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef qpipe fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    classDef out fill:#fff7ed,stroke:#fb923c,color:#0f172a;
    class Q1 qdata;
    class Q2,Q3,Q4,Q5,Q6,Q7 qpipe;
    class Q8 out;

What this diagram shows

This flowchart details the InstructIE data pipeline, starting with 171K raw data. It shows a sequence of six processing steps: field normalization, hard filtering with per-topic P99, quality tier assignment (high/medium/low), derivation of four task types, stratified sampling by task, topic, and quality, and finally, conversion to chat JSONL format. The pipeline culminates in producing train.jsonl, valid.jsonl, and metadata.json files.

When to use it

Use this diagram when designing or explaining a robust, auditable data processing pipeline for large language model (LLM) training, particularly for tasks requiring structured output. It's suitable for scenarios where data quality, traceability, and systematic transformation are critical.

How to adapt it for your project

This pipeline can be adapted by modifying filtering thresholds in `conf.py`, adjusting quality tiering logic, defining different task derivation strategies, or implementing alternative stratified sampling methods. The modular design allows for easy swapping or modification of individual steps to suit specific dataset characteristics or model training requirements.

Key concepts