Production-grade Retrieval-Augmented Generation pipeline showing offline ingest (chunk + embed + store) and online query (retrieve + rerank + generate) with a feedback loop for continuous evals.
flowchart LR
subgraph Ingest[Ingest Pipeline]
D1[Docs / Wiki]
D2[PDFs]
D3[Code & Tickets]
EXT[Document Loaders]
CHUNK[Chunker + Cleaner]
EMBED1[Embedding Model]
end
subgraph Store[Vector Store]
VEC[(Vector DB: pgvector / Pinecone)]
META[(Metadata Index)]
end
subgraph Query[Query Pipeline]
Q[User Query]
EMBED2[Embedding Model]
RET[Hybrid Retriever]
RR[Reranker]
LLM[LLM with Context]
ANS[Answer + Citations]
end
subgraph Eval[Evaluation Loop]
FB[User Feedback]
EVAL[Offline Evals]
end
D1 --> EXT
D2 --> EXT
D3 --> EXT
EXT --> CHUNK --> EMBED1 --> VEC
CHUNK --> META
Q --> EMBED2 --> RET
META --> RET
VEC --> RET
RET --> RR --> LLM --> ANS
ANS --> FB --> EVAL
EVAL -.-> CHUNK
EVAL -.-> RR
A RAG architecture in two halves. The ingest pipeline pulls source documents (wikis, PDFs, code, tickets) through loaders, normalizes and chunks them, computes embeddings, and writes vectors plus filterable metadata to a vector store. At query time, the user's question is embedded, retrieved against both the vector index and the metadata index (hybrid search), reranked by a cross-encoder for precision, and passed to an LLM as grounded context. Answers go back with citations. Feedback feeds an evaluation loop that improves the chunker and the reranker over time.
Use RAG whenever the LLM needs to answer questions over a corpus that does not fit in its context window or that updates frequently — internal documentation Q&A, customer support copilots, code search, and compliance assistants. It is also the right starting point when fine-tuning is overkill or when answers must be grounded in citable sources for audit reasons.
If you are early, start with a single source, naive chunking, and pgvector — you can ship in days and the architecture compose-ups cleanly. Add the reranker once you can measure precision@k slipping below your bar. For multi-tenant SaaS, store a tenant_id on every chunk and filter at retrieval time. For long documents, switch from naive splitting to semantic or structured chunking. Replace the dense retriever with hybrid (BM25 + dense) when factual recall matters more than fluency. For latency-sensitive UX, cache popular query embeddings and stream the LLM response.