high-performance-inference

Installation
SKILL.md

High-Performance Inference

Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.

vLLM 0.14.0 (Jan ): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.

Overview

  • Deploying LLMs with low latency requirements
  • Reducing GPU memory for larger models
  • Maximizing throughput for batch inference
  • Edge/mobile deployment with constrained resources
  • Cost optimization through efficient hardware utilization

Quick Reference

# Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \
Related skills

More from yonatangross/orchestkit

Installs
11
GitHub Stars
170
First Seen
Jan 22, 2026