vllm-server
SKILL.md
vLLM Server Management
Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.
When to Use This Skill
Use this skill when:
- Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale
- Building an OpenAI-compatible API endpoint for self-hosted models
- Optimizing LLM throughput and latency for production traffic
- Running multi-GPU inference with tensor or pipeline parallelism
- Deploying quantized models to reduce GPU memory requirements
Prerequisites
- NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 recommended for production)
- Docker or Python 3.9+ with pip
- 40GB+ VRAM for 70B models; 8GB+ for 7B models
nvidia-container-toolkitfor Docker GPU passthrough