LLMOps - Inference & Optimization - Production Skill Hub

Modern Best Practices (January 2026):

Treat inference as a systems problem: SLOs, tail latency, retries, overload, and cache strategy.
Use continuous batching / smart scheduling when serving many concurrent requests (Orca scheduling: https://www.usenix.org/conference/osdi22/presentation/yu).
Use KV-cache aware serving (PagedAttention/vLLM: https://arxiv.org/abs/2309.06180) and efficient attention kernels (FlashAttention: https://arxiv.org/abs/2205.14135).
Use speculative decoding when latency is critical and draft-model quality is acceptable (speculative decoding: https://arxiv.org/abs/2302.01318).
Quantize only with measured quality impact and rollback plan (quantization must be validated on your eval set).

This skill provides production-ready operational patterns for optimizing LLM inference performance, cost, and reliability. It centralizes decision rules, optimization strategies, configuration templates, and operational checklists for inference workloads.

No theory. No narrative. Only what Codex can execute.

When to Use This Skill

Codex should activate this skill whenever the user asks for:

ai-llm-inference

LLMOps - Inference & Optimization - Production Skill Hub

When to Use This Skill

More from vasilyu1983/ai-agents-public

product-management

software-architecture-design

software-ui-ux-design

qa-testing-playwright

document-pdf

qa-testing-strategy