System Architecture for Voice-Enabled RAG Application

System Design · flowchart diagram · unknown license

An end-to-end system architecture for a voice-enabled RAG application, detailing the flow from audio input through ASR, intent recognition, RAG, LLM genera

Source: https://github.com/hhhhhhjs/Qiniu-Project/blob/203717020cacecd5d17c5233e359898c9fb7f088/README/ProjectDraft.md
Curated by hhhhhhjs
Voice Assistant RAG System System Architecture AI ASR TTS LLM

Mermaid source

flowchart LR
subgraph Client["前端 / 客户端"]
WAKE["语音唤醒"]
MIC["音频采集与上传"]
UI["Web 或 App UI"]
end


subgraph APIGW["后端 API 网关"]
ROUTER["路由 鉴权 限流"]
end


subgraph ASR["Whisper ASR 服务"]
STREAM["流式解码"]
POSTPROC["文本清洗 去口语化 修正"]
end


subgraph INTENT["意图层"]
ZERO["LLM 零样本意图识别"]
BERTCLF["BERT 意图分类器 (后续)"]
end


subgraph RAG["RAG 引擎"]
RET["Milvus 向量检索"]
KB[("知识库 与 元数据表")]
FUSION["重排序 与 证据融合"]
PROMPTCFG["提示模板 与 角色设定 小美"]
end


subgraph LLM["Qwen3 32B"]
GEN["答案生成"]
end


subgraph TTS["Fish Speech"]
SYN["语音合成"]
end


WAKE -->|触发| UI
UI -->|音频| APIGW
APIGW --> ASR
ASR --> INTENT
INTENT --> RAG
RAG --> LLM
LLM -->|文本回答| APIGW
APIGW --> TTS
TTS -->|音频流| UI
RAG -.->|检索日志 与 埋点| KB

What this diagram shows

A comprehensive system architecture for a voice-activated Retrieval Augmented Generation (RAG) application. It illustrates the interaction between client-side components (voice wake-up, audio capture, UI), backend services (API Gateway, Whisper ASR, LLM/BERT intent recognition, Milvus RAG engine, Qwen3 LLM, Fish Speech TTS), and knowledge bases.

When to use it

Use this diagram when designing or documenting a conversational AI system, a voice assistant, or any application that integrates speech processing, RAG, and large language models to provide intelligent, context-aware responses.

How to adapt it for your project

This architecture can be adapted by swapping out specific services (e.g., different ASR, LLM, or TTS providers), modifying the RAG components (e.g., using a different vector database or re-ranking strategy), or enhancing the intent recognition layer with more sophisticated models or domain-specific classifiers.

Key concepts

Voice-Enabled AI
Retrieval Augmented Generation (RAG)
Speech Processing (ASR/TTS)
Large Language Models (LLM)
System Integration