Production RAG Pipeline (Ingest → Embed → Store → Retrieve → Generate)

ML & AI · flowchart diagram · MIT

Production-grade Retrieval-Augmented Generation pipeline showing offline ingest (chunk + embed + store) and online query (retrieve + rerank + generate) with a feedback loop for continuous evals.

Source: https://docs.langchain.com/docs/use-cases/question-answering
Curated by Archigram editorial
rag llm vector-database embeddings ai retrieval

Mermaid source

flowchart LR
    subgraph Ingest[Ingest Pipeline]
        D1[Docs / Wiki]
        D2[PDFs]
        D3[Code & Tickets]
        EXT[Document Loaders]
        CHUNK[Chunker + Cleaner]
        EMBED1[Embedding Model]
    end

    subgraph Store[Vector Store]
        VEC[(Vector DB: pgvector / Pinecone)]
        META[(Metadata Index)]
    end

    subgraph Query[Query Pipeline]
        Q[User Query]
        EMBED2[Embedding Model]
        RET[Hybrid Retriever]
        RR[Reranker]
        LLM[LLM with Context]
        ANS[Answer + Citations]
    end

    subgraph Eval[Evaluation Loop]
        FB[User Feedback]
        EVAL[Offline Evals]
    end

    D1 --> EXT
    D2 --> EXT
    D3 --> EXT
    EXT --> CHUNK --> EMBED1 --> VEC
    CHUNK --> META

    Q --> EMBED2 --> RET
    META --> RET
    VEC --> RET
    RET --> RR --> LLM --> ANS
    ANS --> FB --> EVAL
    EVAL -.-> CHUNK
    EVAL -.-> RR

What this diagram shows

A RAG architecture in two halves. The ingest pipeline pulls source documents (wikis, PDFs, code, tickets) through loaders, normalizes and chunks them, computes embeddings, and writes vectors plus filterable metadata to a vector store. At query time, the user's question is embedded, retrieved against both the vector index and the metadata index (hybrid search), reranked by a cross-encoder for precision, and passed to an LLM as grounded context. Answers go back with citations. Feedback feeds an evaluation loop that improves the chunker and the reranker over time.

When to use it

Use RAG whenever the LLM needs to answer questions over a corpus that does not fit in its context window or that updates frequently — internal documentation Q&A, customer support copilots, code search, and compliance assistants. It is also the right starting point when fine-tuning is overkill or when answers must be grounded in citable sources for audit reasons.

How to adapt it for your project

If you are early, start with a single source, naive chunking, and pgvector — you can ship in days and the architecture compose-ups cleanly. Add the reranker once you can measure precision@k slipping below your bar. For multi-tenant SaaS, store a tenant_id on every chunk and filter at retrieval time. For long documents, switch from naive splitting to semantic or structured chunking. Replace the dense retriever with hybrid (BM25 + dense) when factual recall matters more than fluency. For latency-sensitive UX, cache popular query embeddings and stream the LLM response.

Key concepts

Related diagrams