Overview

Quantization converts high-precision floating point tensors (FP32) into low-precision integers (INT8). This significantly reduces model size and improves inference speed on supported hardware backends like FBGEMM (x86) and QNNPACK (ARM).

When to Use

Use quantization when deploying models to edge devices (mobile/IoT) or when seeking to reduce cloud inference costs by using INT8-optimized CPU instances.

Decision Tree

Do you have a representative calibration dataset but no time for training?
- USE: Post-Training Quantization (PTQ).
Is accuracy drop unacceptable with PTQ?
- USE: Quantization Aware Training (QAT).
Are you running on an ARM-based mobile device?
- SET: torch.backends.quantized.engine = 'qnnpack'.

pytorch-quantization

Overview

When to Use

Decision Tree

Workflows

More from cuba6112/skillfactory

ollama-rag

unsloth-sft

torchaudio

pytorch-onnx

unsloth-lora

torchvision