Quantization for NN Inference

date

Feb 7, 2025

slug

quantization-nn-inference

status

Published

Motivation

In digital systems, real values are typically represented using a finite number of bits, mapping an infinite set to a discrete subset of . This process, known as quantization, is widely used to approximate real-valued data with limited precision.

Formally, quantization can be defined as a mapping function , where is a finite discrete set. Traditional quantization methods aim to minimize quantization error, either in terms of reconstruction error or forward error . However, in the context of neural network inference, the primary goal of quantization is different: rather than minimizing numerical error, it focuses on maintaining task performance (e.g., classification accuracy) while reducing memory and computation costs.

Neural network inference is highly memory- and computation-intensive, making deployment on resource-constrained environments such as edge devices challenging. Quantization, by mapping high-precision parameters to lower-precision representations, enables efficient inference with reduced energy and storage requirements. Unlike traditional quantization techniques, neural network quantization introduces new design choices, as it allows trading off numerical precision for improved efficiency while maintaining model performance. This flexibility has led to a variety of quantization methods tailored to deep learning applications.

Basic Concepts

Uniform Quantization (a.k.a Linear Quantization)

The most intuitive form of quantization is uniform quantization, which maps a range of real numbers uniformly to a set of integers.

Formally, uniform means that for a real value , we map it to , and both and are uniformly spaced. The space between and are called step and level respectively.

We can write uniform quantization operator as:

where S is a scaling factor and Z is a zero points.

Given the range of real values , we can calculate and get by replace by the previous formula.

Uniform quantization can be symmetric or asymmetric, depending on whether the range of real numbers is symmetric. The symmetric method makes implementation more straightforward since it eliminates the zero point Z. However, the asymmetric method is useful when the distribution of real values is highly skewed and asymmetric.

Range Calibration Algorithms: Static v.s. Dynamic Quantization

The process of determining the clipping range is called calibration. According to when the calibration happens, we can categorize quantization methods as static or dynamic quantization. For weights, the clipping range is usually be calculated statically, since weights will not change during inference.

In static quantization, the range is pre-calculated and stay static during inference stage. This method does nor incur any computational overhead during inference, but can have higher accuracy loss compared to dynamic quantization. One popular and simple method is to run a series of calibration inputs to determine the clipping range for activations. Objective like minizizing MSE or Cross Entropy to the unquantized and quantized values are used during calibration. Another approach is to learn this range during NN training.

In dynamic quantization, the clipping range is calculated dynamically for each input, which typically bring higher accuracy. However, the overhead of computing statistics of inputs is considerably high.

Quantization Granularity

Depending on how calibration parameters are shared across weights and activation values, we can categorize quantization methods by their granularity. From coarse to fine grained, there are: layer-wise, group-wise, channel-wise and sub-channel wise. Finer-grained quantization typically bring higher accuracy, but incur higher cost.

Non-uniform Quantization

Some research also explores the use of non-uniform quantization, where both level and step of quantization is not uniformly spaced. Non-uniform quantization can have better accuracy for a fixed-bit width since one could better capture the distribution by focusing more on important regions or finding appropriate dynamic range.

below are some representatives of non-uniform quantization method:

rule-based: logarithmic quantization, where level increases logarithmically with step.

vector quantization-based: use the linear combination of binary vectors to quantize a real value vector by minimizing reconstruction error.

clustering-based

Despite non-uniform methods are more flexible and can have better accuracy compared to uniform methods, they are more difficult to deploy efficiently on hardware accelerators. Therefore, uniform methods are used more oftenly in real-world scenarios currently.

Finetuning Methods

To maintain accuracy, some kinds of training is often required after quantization. The mainstream approaches are Quantization-Aware Training(QAT) and Post-Training Quantization(PTQ).

In QAT, a pre-trained model is quantized and then finetuned using training data to adjust parameters and recover accuracy degradation. (calibration happens during training)

During the backward pass, the gradient is calculated in full precision, and weights are quantized again into integer after weight update.

One challenges in QAT is how to approximate the quantizer in backward pass, since it is not differentiable. A simple yet effective method is to use identity function to approximate it, which works well in most settings.

In PTQ, a pre-trained model is calibrated using calibration data (e.g., a small subset of training data) to compute the clipping ranges and the scaling factors. Then, the model is quantized based on the calibration result.

PTQ demand less resources since it requires less training, but it typically has lower accuracy than QAT.

Advanced Concepts

Simulated v.s. Integer-only quantization

There are two main approaches to deploy quantized NN models: simulated/fake quantization and integer-only quantization.

In simulated quantization, the weights are stored in quantized form but have to be dequantized to fp representation and use floating point arithmetic in computation, thus cannot benefit from the low precision representation during computation.

In integer-only quantization, the computation use integer arithmetic and does not need to dequantize the operands first, thus benefit from the low precision logic (e.g. low precision arithmetic is significantly faster on NVIDIA GPU).

However, in some cases, the task is not compute bound, using simulated quantization is not a bad choice.

Mixed-precision Quantization

We can get speed up from using lower precision representation, but using a single config for all layers in models can significantly degrade the accuracy. Mixed-precision quantization considers this factor and use different quantization settings for different layers according to their “sensitivity”. Layers are grouped as sensitive and insensitive and use higher/lower bits in quantization configs. In this way we can maintain accuracy while reduce the cost.

In this sense the mixed-precision quantization can be viewed as a search problem. One challenge of this problem is the search space grows exponentially with the number of layers. Some research have explored NAS-based approach and RL-based approach to address this problem.

Hardware-aware Quantization

The benefit of quantization is often hardware dependent since it depends on certain hardware parameters such as on-chip memory, bandwidth, etc. Some work use RL-based approach with a cost look-up table to determine the quantization configuration on certain hardware.

Distillation-assisted Quantization

Knowledge distillation is also a popular approach for compressing large models, where a small student model learn from the “soft-labels” (logits of teacher models) of teach models. The distillation approach can also be used to assisted quantization, for example, the quantized model can learn from its unquantized version.

Extreme Quantization

Extreme quantization methods (e.g. binarization, tenarization) quantize real values using very little bits (1 or 2 bit), which drastically reduces the computational and storage overhead. The main challenge is to maintain accuracy. There are three branches for addressing this chanllenge: minimizing quantization error; improving loss fucntion; improving training method.

Case Study: Deep Compression

Deep compression is an example of coupling multiple compression techniques to achieve significant cost reduction while maintaining accuracy.

Deep compression use a three-stage pipeline to compress the model for inference:

pruning: NN is often over-parameterized and heavily regularized, so we can prune small weights below a threshold.

quantization: K-mean clustering based quantization.

huffman coding

Through the application of three compression techniques that do not interfere with each other, deep compression achieve promising results.

Summary

Quantization is a key technique for reducing the computational and memory cost of neural networks, enabling efficient deployment on resource-constrained hardware. Unlike traditional quantization, neural network quantization prioritizes task performance over numerical precision, requiring techniques like fine-tuning (QAT/PTQ) and distillation-assisted quantization to mitigate accuracy loss.

Key takeaways:

Lower-bit quantization (e.g., INT8, INT4) reduces cost but may degrade accuracy.

Non-uniform quantization captures data distributions better but is harder to deploy.

Mixed-precision and hardware-aware quantization optimize efficiency across different platforms.

Combining pruning, quantization, and compression (e.g., Deep Compression) maximizes cost savings while preserving accuracy.