Hardware-Aware Mixed-Precision Neural Networks using In-Train Quantization

Abstract: Fixed-point quantization is an effective method to reduce the model size and computational demand of convolutional neural networks, by lowering the numerical precision of all layers down to a specific bit-width. Recent work shows assigning layer-wise specific bit-widths has an advantage over uniform assignments, although requiring complex, post-training search techniques and many GPU hours to identify the optimal bit-width strategy. To alleviate this, we propose an in-train quantization method that can directly learn the optimal bit-widths for weights and activations during the gradient-based training process. We incorporate hardware-awareness into the gradient-based optimization to directly improve the real hardware execution metrics. We replace the discrete and non-differentiable hardware measurements with a differentiable Gaussian process regressor. This provides accurate hardware predictions as an auxiliary loss to the gradient-descent optimizer, performing hardware-friendly in-train quantization. Our hardware-aware mixed-precision ResNet56 achieves an improvement of 1.3 x in execution latency compared to the uniform 4-bit quantization with no degradation in accuracy. Finally, we highlight the effectiveness of the in-train quantization method in the context of adversarial training, improving the trade-off between prediction accuracy and robustness.

Hardware-Aware Mixed-Precision Neural Networks using In-Train Quantization

Manoj Rohit Vemparala, Nael Fasfous, Lukas Frickenstein, Alexander Frickenstein, Anmol Singh, Driton Salihu, Christian Unger, Naveen Shankar Nagaraja, WALTER STECHELE

Comments

Similar Papers