gguf-quantization
Installation
SKILL.md
GGUF - Quantization Format for llama.cpp
The GGUF (GPT-Generated Unified Format) is the standard file format for llama.cpp, enabling efficient inference on CPUs, Apple Silicon, and GPUs with flexible quantization options.
When to use GGUF
Use GGUF when:
- Deploying on consumer hardware (laptops, desktops)
- Running on Apple Silicon (M1/M2/M3) with Metal acceleration
- Need CPU inference without GPU requirements
- Want flexible quantization (Q2_K to Q8_0)
- Using local AI tools (LM Studio, Ollama, text-generation-webui)
Key advantages:
- Universal hardware: CPU, Apple Silicon, NVIDIA, AMD support
- No Python runtime: Pure C/C++ inference
- Flexible quantization: 2-8 bit with various methods (K-quants)
- Ecosystem support: LM Studio, Ollama, koboldcpp, and more
- imatrix: Importance matrix for better low-bit quality
Related skills
More from nousresearch/hermes-agent
dogfood
Exploratory QA of web apps: find bugs, evidence, reports.
2.4Kyuanbao
Yuanbao (元宝) groups: @mention users, query info/members.
161llm-wiki
Karpathy's LLM Wiki: build/query interlinked markdown KB.
20manim-video
Manim CE animations: 3Blue1Brown math/algo videos.
15powerpoint
Create, read, edit .pptx decks, slides, notes, templates.
14ocr-and-documents
Extract text from PDFs/scans (pymupdf, marker-pdf).
14