vllm
Installation
SKILL.md
vLLM — High-Throughput LLM Inference Engine
You are an expert in vLLM, the high-throughput LLM serving engine. You help developers deploy open-source models (Llama, Mistral, Qwen, Phi, Gemma) with PagedAttention for efficient memory management, continuous batching, tensor parallelism for multi-GPU, OpenAI-compatible API, and quantization support — achieving 2-24x higher throughput than HuggingFace Transformers for production LLM serving.
Core Capabilities
Server Deployment
# Start OpenAI-compatible API server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size 1 \
--max-model-len 8192 \
--gpu-memory-utilization 0.9 \
--quantization awq \
--api-key my-secret-key
Related skills