pytorch-quantization

Installation
SKILL.md

Overview

Quantization converts high-precision floating point tensors (FP32) into low-precision integers (INT8). This significantly reduces model size and improves inference speed on supported hardware backends like FBGEMM (x86) and QNNPACK (ARM).

When to Use

Use quantization when deploying models to edge devices (mobile/IoT) or when seeking to reduce cloud inference costs by using INT8-optimized CPU instances.

Decision Tree

  1. Do you have a representative calibration dataset but no time for training?
    • USE: Post-Training Quantization (PTQ).
  2. Is accuracy drop unacceptable with PTQ?
    • USE: Quantization Aware Training (QAT).
  3. Are you running on an ARM-based mobile device?
    • SET: torch.backends.quantized.engine = 'qnnpack'.

Workflows

Related skills

More from cuba6112/skillfactory

Installs
3
First Seen
Feb 9, 2026