Qwen/vLLM Service Inference Path

ML & AI · flowchart diagram · MIT

This diagram illustrates the service-based inference path for a Qwen large language model, leveraging vLLM to expose an OpenAI-compatible HTTP API.

Source: https://github.com/jiaran-king/MicroLM/blob/782ae02f10c14b484a317f22115a066b3b10b91d/Readme/%E9%A1%B9%E7%9B%AE%E5%85%A8%E6%99%AF%E5%9B%BE/00-%E5%85%A8%E6%B5%81%E7%A8%8B%E5%88%86%E6%9E%90%EF%BC%88%E8%AE%AD%E7%BB%83%E3%80%81%E6%8E%A8%E7%90%86%E3%80%81%E8%AF%84%E6%B5%8B%E4%B8%8E%E9%83%A8%E7%BD%B2%EF%BC%89.md
Curated by jiaran-king
Qwen vLLM LLM Inference API Model Serving LoRA

Mermaid source

%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f5f3ff", "primaryBorderColor": "#8b5cf6", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
    M["Qwen base + LoRA adaptor"] --> EX["export_final_model.py<br>merge_and_unload"]
    EX --> HF["merged HF model dir"]
    HF --> SV["serve_vllm.sh<br>启动 OpenAI-compatible API"]
    SV --> REQ["HTTP request<br>messages / prompt / response_format"]
    REQ --> RT["vLLM runtime<br>tokenizer → prefill → decode"]
    RT --> RESP["JSON / text response"]

    classDef deploy fill:#f5f3ff,stroke:#8b5cf6,color:#0f172a;
    classDef api fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
    classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
    class M,EX,HF,SV,RT deploy;
    class REQ api;
    class RESP out;

What this diagram shows

The diagram details the process of deploying a Qwen base model with a LoRA adaptor for service-based inference. It starts with merging the LoRA adaptor into the base model using 'export_final_model.py', resulting in a merged Hugging Face model directory. This model is then served by 'vLLM' via 'serve_vllm.sh', which launches an OpenAI-compatible HTTP API. Incoming HTTP requests containing messages, prompts, and response format specifications are processed by the vLLM runtime, which handles tokenization, prefill, and decode stages, finally returning a JSON or text response.

When to use it

Use this pattern when deploying large language models (LLMs) like Qwen for production inference, requiring high throughput, low latency, and an easily consumable HTTP API. It's ideal for scenarios where a fine-tuned model needs to be exposed as a scalable service, especially when integrating with applications that expect an OpenAI-like API interface.

How to adapt it for your project

This diagram can be adapted by swapping Qwen with other Hugging Face compatible LLMs. The 'export_final_model.py' script can be modified for different merging strategies or model formats. The vLLM serving setup can be scaled horizontally with load balancers, configured for different hardware (e.g., multiple GPUs), or integrated into container orchestration systems like Kubernetes. The API interface can be customized, and additional pre/post-processing steps can be added before or after the vLLM runtime.

Key concepts