vllm-deploy-docker
vLLM Docker Deployment
A Claude skill describing how to deploy vLLM with Docker using the official pre-built images or building the image from source supporting NVIDIA GPUs with CUDA. Instructions include NVIDIA CUDA support, example docker run and a minimal docker-compose snippet, recommended flags, and troubleshooting notes. For AMD, Intel, or other accelerators, please refer to the vLLM documentation for alternative deployment methods.
What this skill does
- Deploy vLLM with docker using pre-built images (recommended for most users) or build from source for custom configurations
- Provide example commands for running the OpenAI-compatible server with GPU access and mounted Hugging Face cache
- Point to build-from-source instructions when a custom image or optional dependencies are needed
- Explain common flags:
--ipc=host, shared cache mounts, andHF_TOKENhandling
Prerequisites
- Docker Engine installed (Docker 20.10+ recommended)
- NVIDIA GPU(s) with appropriate drivers and CUDA toolkit installed
- Optional:
curlfor API tests - A Hugging Face token if pulling private models or to avoid rate-limits:
HF_TOKEN
Quickstart using Pre-built Image (recommended)
More from vllm-project/vllm-skills
vllm-deploy-simple
Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.
51vllm-deploy-k8s
Deploy vLLM to Kubernetes (K8s) with GPU support, health probes, and OpenAI-compatible API endpoint. Use this skill whenever the user wants to deploy, run, or serve vLLM on a Kubernetes cluster, including creating deployments, services, checking existing deployments, or managing vLLM on K8s.
40vllm-bench-serve
Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.
39vllm-bench-random-synthetic
Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.
38vllm-prefix-cache-bench
This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.
37