Abstract:
In this paper, we investigate the feasibility of training low complexity convolutional neural networks directly from waveforms. While the vast majority of proposed approaches perform fixed feature extraction based on time-frequency representations such as spectrograms, we propose to fully exploit the information in waveforms directly and to minimize the model size. To do so, we train one dimensional Convolutional Neural Networks (1D-CNN) on raw, subsampled binaural audio waveforms, thus exploiting phase information within and across the two input channels. In addition, our approach relies heavily on data augmentation in the temporal domain. Finally, we apply iterative structured parameter pruning to remove the least important convolutional kernels, and perform weight quantization in floating point half precision. We apply this approach on the TAU Urban Acoustic Scenes 2020 3class dataset, with two network architectures : a 1D-CNN based on VGG-like blocks, as well as a ResNet architecture with 1D convolutions, and compare our results with the baseline model from the DCASE 2020 challenge, task 1 subtask B. We report four models that constitute our submission to the DCASE 2020 challenge, task 1 subtask B. Our results show that we can train, prune and quantify a small VGG model to make it 20 times smaller than the 500 KB challenge limit with an accuracy at baseline level (87.6 %), as well as a larger model achieving 91 % of accuracy while being 8 times smaller than the challenge limit. ResNets could be successfully trained, pruned and quantify in order to be below the 500 KB limit, achieving up to 91.2% accuracy. We also report the stability of these results according to data augmentation and monoraul versus binaural inputs.