An end-to-end system architecture for a voice-enabled RAG application, detailing the flow from audio input through ASR, intent recognition, RAG, LLM genera
flowchart LR
subgraph Client["前端 / 客户端"]
WAKE["语音唤醒"]
MIC["音频采集与上传"]
UI["Web 或 App UI"]
end
subgraph APIGW["后端 API 网关"]
ROUTER["路由 鉴权 限流"]
end
subgraph ASR["Whisper ASR 服务"]
STREAM["流式解码"]
POSTPROC["文本清洗 去口语化 修正"]
end
subgraph INTENT["意图层"]
ZERO["LLM 零样本意图识别"]
BERTCLF["BERT 意图分类器 (后续)"]
end
subgraph RAG["RAG 引擎"]
RET["Milvus 向量检索"]
KB[("知识库 与 元数据表")]
FUSION["重排序 与 证据融合"]
PROMPTCFG["提示模板 与 角色设定 小美"]
end
subgraph LLM["Qwen3 32B"]
GEN["答案生成"]
end
subgraph TTS["Fish Speech"]
SYN["语音合成"]
end
WAKE -->|触发| UI
UI -->|音频| APIGW
APIGW --> ASR
ASR --> INTENT
INTENT --> RAG
RAG --> LLM
LLM -->|文本回答| APIGW
APIGW --> TTS
TTS -->|音频流| UI
RAG -.->|检索日志 与 埋点| KB
A comprehensive system architecture for a voice-activated Retrieval Augmented Generation (RAG) application. It illustrates the interaction between client-side components (voice wake-up, audio capture, UI), backend services (API Gateway, Whisper ASR, LLM/BERT intent recognition, Milvus RAG engine, Qwen3 LLM, Fish Speech TTS), and knowledge bases.
Use this diagram when designing or documenting a conversational AI system, a voice assistant, or any application that integrates speech processing, RAG, and large language models to provide intelligent, context-aware responses.
This architecture can be adapted by swapping out specific services (e.g., different ASR, LLM, or TTS providers), modifying the RAG components (e.g., using a different vector database or re-ranking strategy), or enhancing the intent recognition layer with more sophisticated models or domain-specific classifiers.