Abstract:
Neural network processors on edge devices need to process spatiotemporal data with low latency, which requires a large amount of multiply-accumulate operation (MAC). In this paper, we propose a difference-driven neural network framework for efficient video or event stream processing. Our framework achieves lower MAC by learning to sparsify the ``temporal differences of synaptic signals'' (TDSS) of proposed masked convolutional neural networks. By reducing the TDSS, MAC reduction is achieved in a unified manner by increasing the quantization step size, disconnecting synapses, and learning weights that respond sparsely to inputs. A novel quantizer is another key to realize unified optimization; the quantizer has a gradient called ``macro-grad'' that guides the step size to reduce the MAC by reducing the TDSS loss. Experiments conducted using a wide range of tasks and data (frames/events) show that the proposed framework can reduce MAC by a factor of 32 to 240 compared to dense convolution while maintaining comparable accuracy, which is several times better than the current state-of-the-art methods.