Voice Chat Assistant Core Data Flow

System Design · sequence diagram · unknown license

Illustrates the core data flow of a voice chat assistant, covering user input, ASR, intent classification, context pooling, knowledge retrieval, and respon

Source: https://github.com/4xiaxia/demo2/blob/25b01b5091beb79cc0919481130da8e105aafb04/readme/PROJECT_XRAY_DIAGNOSIS.md
Curated by 4xiaxia
Voice Assistant Chatbot Conversational AI System Design Microservices Redis MongoDB

Mermaid source

sequenceDiagram
    participant U as 用户
    participant F as 前端
    participant S as Server API
    participant A as Agent A
    participant Pool as Context Pool<br/>(Redis)
    participant B as Agent B
    participant C as Agent C
    participant D as Agent D
    participant DB as MongoDB

    U->>F: 访问 /chat?merchant=dongli&userId=uuid123&mode=voice
    F->>S: POST /api/user-enter
    S->>D: 系统通知: 用户进入
    D->>DB: 写入: 用户进入事件

    U->>F: 语音输入 (按住说话)
    F->>S: POST /api/process-input (audio)
    S->>A: processInput(uuid123, audio)

    A->>A: ASR转文字 + 意图分类
    A->>Pool: addTurn(user问题)
    A->>Bus: publish(A→B)

    Bus->>B: 通知B有任务
    B->>Pool: getRecentTurns(uuid123, 5条)

    alt 缓存命中
        B->>Pool: findSimilarAnswer()
        Pool-->>B: 返回历史答案
        B->>B: 润色回复 + TTS
    else 缓存未命中
        B->>Bus: publish(B→C)
        Bus->>C: 通知C检索
        C->>JSON: 搜索知识库
        C->>Pool: 查上下文(多条结果时)
        C->>Bus: publish(C→B, 结果)
        B->>B: 生成回复 + TTS
    end

    B->>Pool: addTurn(assistant回复)
    B->>Bus: publish(B→USER)
    Bus->>S: responseStore.save(traceId)

    D->>DB: 写入: 完整流程日志

    F->>S: GET /api/poll-response?traceId=xxx
    S-->>F: 返回回复 + audioBase64
    F->>U: 播放语音 + 显示文字

What this diagram shows

This sequence diagram details the end-to-end interaction for a voice chat assistant. It starts with a user accessing a chat interface, triggering a user entry event logged in MongoDB. Upon voice input, the audio is sent to Agent A for Automatic Speech Recognition (ASR) and intent classification. The user's query is added to a Redis-based Context Pool. Agent B then retrieves recent turns from the pool, attempts to find a similar answer in cache, and either generates a response or involves Agent C for knowledge base retrieval (from JSON). Agent B then generates the final assistant response, adds it to the Context Pool, and publishes it back to the Server API. Agent D logs the complete interaction flow in MongoDB, and the response is polled by the frontend to be displayed and played to the user.

When to use it

This pattern is useful for designing conversational AI systems, voice assistants, chatbots, or customer service automation platforms that require real-time voice processing, context management, knowledge retrieval, and asynchronous communication between microservices or agents. It's particularly relevant for systems needing to maintain conversation history and leverage cached responses.

How to adapt it for your project

This diagram can be adapted by integrating additional agents for specific tasks (e.g., sentiment analysis, transaction processing), using different knowledge bases (e.g., external APIs, RAG), implementing alternative caching strategies, or introducing a message queue for more robust asynchronous communication. The Context Pool could be extended with more sophisticated session management, and the ASR/TTS components could be swapped with different providers.

Key concepts