This sequence diagram illustrates a Retrieval-Augmented Generation (RAG) workflow for conversational AI, from user voice input to synthesized speech output
sequenceDiagram
participant U as 用户
participant ASR as Whisper ASR
participant C as 内容清洗(增删改写)
participant I as 意图识别(LLM/BERT)
participant M as Milvus 检索
participant R as 证据融合/Prompt
participant Q as Qwen3-32B
participant T as Fish-Speech
U->>ASR: 语音流
ASR-->>C: 文本转写(带时间戳)
C-->>I: 去口语化/小修正
I-->>M: 推断意图/检索 Query
M-->>R: 召回候选文档(Top‑k)
R-->>Q: 结构化提示(角色=小美+证据)
Q-->>R: 生成答案(含引用)
R-->>T: 传递最终文本
T-->>U: 合成语音返回
This diagram details a complete RAG workflow for a voice-enabled conversational AI system. It starts with a user's voice input, which is processed by Whisper ASR for transcription. The transcribed text undergoes content cleaning (e.g., removing filler words). Intent recognition then determines the user's intent and generates a query for Milvus, a vector database. Milvus retrieves relevant candidate documents (Top-k), which are then fused as evidence and incorporated into a structured prompt for the Qwen3-32B Large Language Model. The LLM generates an answer, including citations, which is then passed to Fish-Speech for text-to-speech synthesis, returning an audio response to the user.
Use this diagram to understand or design conversational AI systems that leverage voice input, knowledge retrieval (RAG), and large language models for generating informed responses, particularly in scenarios requiring up-to-date or domain-specific information.
This workflow can be adapted by swapping out specific components: using different ASR engines, alternative LLMs (e.g., BERT, GPT models), various vector databases (e.g., Pinecone, Weaviate), or different TTS solutions. The content cleaning and intent recognition modules can be customized for specific domains or languages. The retrieval strategy (e.g., hybrid search, re-ranking) and prompt engineering can also be refined.