tensorrt-llm
Installation
SKILL.md
TensorRT-LLM
NVIDIA's open-source library for optimizing LLM inference with state-of-the-art performance on NVIDIA GPUs.
When to use TensorRT-LLM
Use TensorRT-LLM when:
- Deploying on NVIDIA GPUs (A100, H100, GB200)
- Need maximum throughput (24,000+ tokens/sec on Llama 3)
- Require low latency for real-time applications
- Working with quantized models (FP8, INT4, FP4)
- Scaling across multiple GPUs or nodes
Use vLLM instead when:
- Need simpler setup and Python-first API
- Want PagedAttention without TensorRT compilation
- Working with AMD GPUs or non-NVIDIA hardware
Related skills
More from nousresearch/hermes-agent
dogfood
Exploratory QA of web apps: find bugs, evidence, reports.
2.5Kyuanbao
Yuanbao (元宝) groups: @mention users, query info/members.
173llm-wiki
Karpathy's LLM Wiki: build/query interlinked markdown KB.
20manim-video
Manim CE animations: 3Blue1Brown math/algo videos.
15powerpoint
Create, read, edit .pptx decks, slides, notes, templates.
14ocr-and-documents
Extract text from PDFs/scans (pymupdf, marker-pdf).
14