dspy-vllm

Installation
SKILL.md

vLLM — High-Throughput Production Serving for DSPy

Guide the user through serving self-hosted models with vLLM for production DSPy deployments. High concurrency, multi-GPU, OpenAI-compatible API.

Step 1: Understand the setup

Before generating vLLM configuration, clarify:

  1. What GPU hardware? — Model (A100, H100, RTX 4090), count, and VRAM per GPU. This determines tensor parallelism and quantization needs.
  2. Which model? — Model name and size (7B, 13B, 70B). Determines VRAM requirements and whether quantization is needed.
  3. Workload type? — Production serving (concurrent users), batch processing (offline), or optimization (running MIPROv2/BootstrapFewShot)?
  4. Already using Ollama locally? — If yes, help them add vLLM for production while keeping Ollama for dev.

What is vLLM

vLLM is a high-throughput inference engine (74k+ GitHub stars) for LLMs. Key features:

  • PagedAttention — 4x memory efficiency vs naive attention, serves more concurrent users
  • Continuous batching — processes requests as they arrive, no waiting for batch to fill
Related skills

More from lebsral/dspy-programming-not-prompting-lms-skills

Installs
3
GitHub Stars
5
First Seen
Apr 13, 2026