skills/modelscope.cn/vllm-server

vllm-server

SKILL.md

vLLM Server Management

Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.

When to Use This Skill

Use this skill when:

Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale
Building an OpenAI-compatible API endpoint for self-hosted models
Optimizing LLM throughput and latency for production traffic
Running multi-GPU inference with tensor or pipeline parallelism
Deploying quantized models to reduce GPU memory requirements

Prerequisites

NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 recommended for production)
Docker or Python 3.9+ with pip
40GB+ VRAM for 70B models; 8GB+ for 7B models
nvidia-container-toolkit for Docker GPU passthrough

Installs

1

Source

modelscope.cn/s…m-server

First Seen

Apr 10, 2026