RAG Workflow for Conversational AI

ML & AI · sequence diagram · unknown license

This sequence diagram illustrates a Retrieval-Augmented Generation (RAG) workflow for conversational AI, from user voice input to synthesized speech output

Source: https://github.com/hhhhhhjs/Qiniu-Project/blob/203717020cacecd5d17c5233e359898c9fb7f088/README/ProjectDraft.md
Curated by hhhhhhjs
RAG LLM ASR TTS Milvus Qwen Conversational AI

Mermaid source

sequenceDiagram
    participant U as 用户
    participant ASR as Whisper ASR
    participant C as 内容清洗(增删改写)
    participant I as 意图识别(LLM/BERT)
    participant M as Milvus 检索
    participant R as 证据融合/Prompt
    participant Q as Qwen3-32B
    participant T as Fish-Speech

    U->>ASR: 语音流
    ASR-->>C: 文本转写(带时间戳)
    C-->>I: 去口语化/小修正
    I-->>M: 推断意图/检索 Query
    M-->>R: 召回候选文档(Top‑k)
    R-->>Q: 结构化提示(角色=小美+证据)
    Q-->>R: 生成答案(含引用)
    R-->>T: 传递最终文本
    T-->>U: 合成语音返回

What this diagram shows

This diagram details a complete RAG workflow for a voice-enabled conversational AI system. It starts with a user's voice input, which is processed by Whisper ASR for transcription. The transcribed text undergoes content cleaning (e.g., removing filler words). Intent recognition then determines the user's intent and generates a query for Milvus, a vector database. Milvus retrieves relevant candidate documents (Top-k), which are then fused as evidence and incorporated into a structured prompt for the Qwen3-32B Large Language Model. The LLM generates an answer, including citations, which is then passed to Fish-Speech for text-to-speech synthesis, returning an audio response to the user.

When to use it

Use this diagram to understand or design conversational AI systems that leverage voice input, knowledge retrieval (RAG), and large language models for generating informed responses, particularly in scenarios requiring up-to-date or domain-specific information.

How to adapt it for your project

This workflow can be adapted by swapping out specific components: using different ASR engines, alternative LLMs (e.g., BERT, GPT models), various vector databases (e.g., Pinecone, Weaviate), or different TTS solutions. The content cleaning and intent recognition modules can be customized for specific domains or languages. The retrieval strategy (e.g., hybrid search, re-ranking) and prompt engineering can also be refined.

Key concepts

Retrieval-Augmented Generation (RAG)
Automatic Speech Recognition (ASR)
Large Language Models (LLM)
Text-to-Speech (TTS)
Vector Database