vllm-deploy-k8s
vLLM Kubernetes Deployment
A Claude skill for deploying vLLM to Kubernetes using YAML templates. Deploys a vLLM OpenAI-compatible server as a Kubernetes Deployment with a ClusterIP Service, GPU resources, and health probes.
What this skill does
- Deploy vLLM as a Kubernetes Deployment + Service with NVIDIA GPU support
- Check if a vLLM deployment already exists before deploying
- Check if the Hugging Face token secret exists, and ask the user for their token if not
- Use the
vllm/vllm-openai:latestimage by default (user can specify a different version) - Provide sensible default configuration that users can customize (model, replicas, GPU count, extra vLLM flags, etc.)
Prerequisites
kubectlconfigured with access to a Kubernetes cluster- NVIDIA GPU Operator or device plugin installed on cluster nodes
- Hugging Face token (required for gated models like Llama, optional for public models)
Deployment Steps
More from vllm-project/vllm-skills
vllm-deploy-docker
Deploy vLLM using Docker (pre-built images or build-from-source) with NVIDIA GPU support and run the OpenAI-compatible server.
70vllm-deploy-simple
Quick install and deploy vLLM, start serving with a simple LLM, and test OpenAI API.
51vllm-bench-serve
Benchmark vLLM or OpenAI-compatible serving endpoints using vllm bench serve. Supports multiple datasets (random, sharegpt, sonnet, HF), backends (openai, openai-chat, vllm-pooling, embeddings), throughput/latency testing with request-rate control, and result saving. Use when benchmarking LLM serving performance, measuring TTFT/TPOT, or load testing inference APIs.
39vllm-bench-random-synthetic
Run vLLM performance benchmark using synthetic random data to measure throughput, TTFT (Time to First Token), TPOT (Time per Output Token), and other key performance metrics. Use when the user wants to quickly test vLLM serving performance without downloading external datasets.
38vllm-prefix-cache-bench
This is a skill for benchmarking the efficiency of automatic prefix caching in vLLM using fixed prompts, real-world datasets, or synthetic prefix/suffix patterns. Use when the user asks to benchmark prefix caching hit rate, caching efficiency, or repeated-prompt performance in vLLM.
37