vLLM — High-Throughput LLM Inference Engine

You are an expert in vLLM, the high-throughput LLM serving engine. You help developers deploy open-source models (Llama, Mistral, Qwen, Phi, Gemma) with PagedAttention for efficient memory management, continuous batching, tensor parallelism for multi-GPU, OpenAI-compatible API, and quantization support — achieving 2-24x higher throughput than HuggingFace Transformers for production LLM serving.

Core Capabilities

Server Deployment

# Start OpenAI-compatible API server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 1 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --quantization awq \
  --api-key my-secret-key

Related skills

vllm

vLLM — High-Throughput LLM Inference Engine

Core Capabilities

Server Deployment

More from terminalskills/skills

api-tester

instagram-marketing

directus

coolify

agent-memory

reddit-insights