Illustrates the deployment and inference pipeline for a Qwen model fine-tuned with LoRA, served via vLLM's OpenAI-compatible API, handling HTTP requests an
%%{init: {"theme": "base", "themeVariables": {"background": "#ffffff", "primaryColor": "#f5f3ff", "primaryBorderColor": "#8b5cf6", "primaryTextColor": "#0f172a", "lineColor": "#64748b"}}}%%
flowchart TB
M["Qwen base + LoRA adaptor"] --> EX["export_final_model.py<br>merge_and_unload"]
EX --> HF["merged HF model dir"]
HF --> SV["serve_vllm.sh<br>启动 OpenAI-compatible API"]
SV --> REQ["HTTP request<br>messages / prompt / response_format"]
REQ --> RT["vLLM runtime<br>tokenizer → prefill → decode"]
RT --> RESP["JSON / text response"]
classDef deploy fill:#f5f3ff,stroke:#8b5cf6,color:#0f172a;
classDef api fill:#eff6ff,stroke:#60a5fa,color:#0f172a;
classDef out fill:#f0fdf4,stroke:#22c55e,color:#0f172a;
class M,EX,HF,SV,RT deploy;
class REQ api;
class RESP out;
This diagram details the full inference path for a Qwen base model with a LoRA adaptor, from merging the model and exporting it to a Hugging Face directory, to serving it via vLLM's OpenAI-compatible API. It covers the HTTP request and response flow, including tokenization, prefill, and decode stages within the vLLM runtime.
Use this diagram to understand how to deploy and serve a fine-tuned Large Language Model (LLM) like Qwen using vLLM for high-throughput, low-latency inference. It is particularly relevant when an OpenAI-compatible API endpoint is desired for integration with other applications or services.
This pattern can be adapted for serving other Hugging Face-compatible LLMs with vLLM. The model merging step might vary depending on the specific fine-tuning method. The `response_format=json_object` feature highlights a key aspect for structured output scenarios, which can be customized or removed based on application needs.