ai-llm-inference
Installation
SKILL.md
LLMOps - Inference & Optimization - Production Skill Hub
Modern Best Practices (January 2026):
- Treat inference as a systems problem: SLOs, tail latency, retries, overload, and cache strategy.
- Use continuous batching / smart scheduling when serving many concurrent requests (Orca scheduling: https://www.usenix.org/conference/osdi22/presentation/yu).
- Use KV-cache aware serving (PagedAttention/vLLM: https://arxiv.org/abs/2309.06180) and efficient attention kernels (FlashAttention: https://arxiv.org/abs/2205.14135).
- Use speculative decoding when latency is critical and draft-model quality is acceptable (speculative decoding: https://arxiv.org/abs/2302.01318).
- Quantize only with measured quality impact and rollback plan (quantization must be validated on your eval set).
This skill provides production-ready operational patterns for optimizing LLM inference performance, cost, and reliability. It centralizes decision rules, optimization strategies, configuration templates, and operational checklists for inference workloads.
No theory. No narrative. Only what Codex can execute.
When to Use This Skill
Codex should activate this skill whenever the user asks for: