high-performance-inference

Installation

SKILL.md

High-Performance Inference

Optimize LLM inference for production with vLLM 0.14.x, quantization, and speculative decoding.

vLLM 0.14.0 (Jan ): PyTorch 2.9.0, CUDA 12.9, AttentionConfig API, Python 3.12+ recommended.

Overview

Deploying LLMs with low latency requirements
Reducing GPU memory for larger models
Maximizing throughput for batch inference
Edge/mobile deployment with constrained resources
Cost optimization through efficient hardware utilization

Quick Reference

# Basic vLLM server
vllm serve meta-llama/Meta-Llama-3.1-70B-Instruct \

Related skills

More from yonatangross/orchestkit

responsive-patterns
Responsive design with Container Queries, fluid typography, cqi/cqb units, subgrid, intrinsic layouts, foldable devices, and mobile-first patterns for React applications. Use when building responsive layouts or container queries.
491
ui-components
UI component library patterns for shadcn/ui and Radix Primitives. Use when building accessible component libraries, customizing shadcn components, using Radix unstyled primitives, or creating design system foundations.
476
devops-deployment
Use when setting up CI/CD pipelines, containerizing applications, deploying to Kubernetes, or writing infrastructure as code. DevOps & Deployment covers GitHub Actions, Docker, Helm, and Terraform patterns.
450
rag-retrieval
Retrieval-Augmented Generation patterns for grounded LLM responses. Use when building RAG pipelines, embedding documents, implementing hybrid search, contextual retrieval, HyDE, agentic RAG, multimodal RAG, query decomposition, reranking, or pgvector search.
373
architecture-decision-record
Use this skill when documenting significant architectural decisions. Provides ADR templates following the Nygard format with sections for context, decision, consequences, and alternatives. Use when writing ADRs, recording decisions, or evaluating options.
368
domain-driven-design
DDD tactical patterns for complex business modeling including entities, value objects, aggregates, domain services, repositories, specifications, and bounded contexts. Python dataclass implementations with TypeScript alternatives. Use when building rich domain models, enforcing invariants, or separating domain logic from infrastructure.
355

Installs

Repository

yonatangross/orchestkit

GitHub Stars

170

First Seen

Jan 22, 2026

Security Audits

Gen Agent Trust HubFail

SocketPass

SnykWarn

high-performance-inference

High-Performance Inference

Overview

Quick Reference

More from yonatangross/orchestkit

responsive-patterns

ui-components

devops-deployment

rag-retrieval

architecture-decision-record

domain-driven-design