pytorch-quantization
SKILL.md
Overview
Quantization converts high-precision floating point tensors (FP32) into low-precision integers (INT8). This significantly reduces model size and improves inference speed on supported hardware backends like FBGEMM (x86) and QNNPACK (ARM).
When to Use
Use quantization when deploying models to edge devices (mobile/IoT) or when seeking to reduce cloud inference costs by using INT8-optimized CPU instances.
Decision Tree
- Do you have a representative calibration dataset but no time for training?
- USE: Post-Training Quantization (PTQ).
- Is accuracy drop unacceptable with PTQ?
- USE: Quantization Aware Training (QAT).
- Are you running on an ARM-based mobile device?
- SET:
torch.backends.quantized.engine = 'qnnpack'.
- SET: