skills/modelscope.cn/vllm-server

vllm-server

SKILL.md

vLLM Server Management

Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.

When to Use This Skill

Use this skill when:

  • Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale
  • Building an OpenAI-compatible API endpoint for self-hosted models
  • Optimizing LLM throughput and latency for production traffic
  • Running multi-GPU inference with tensor or pipeline parallelism
  • Deploying quantized models to reduce GPU memory requirements

Prerequisites

  • NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 recommended for production)
  • Docker or Python 3.9+ with pip
  • 40GB+ VRAM for 70B models; 8GB+ for 7B models
  • nvidia-container-toolkit for Docker GPU passthrough
Installs
1
First Seen
Apr 10, 2026