high-performance-inference

Installation
SKILL.md

High-Performance Inference

Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.

vLLM 0.14.0 (Jan 2026): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.

Overview

  • Deploying LLMs with low latency requirements
  • Reducing GPU memory for larger models
  • Maximizing throughput for batch inference
  • Edge/mobile deployment with constrained resources
  • Cost optimization through efficient hardware utilization

Quick Reference

# Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
Related skills

More from yonatangross/skillforge-claude-plugin

Installs
4
GitHub Stars
170
First Seen
Jan 21, 2026