Abstract:
Sound events are crucial to discern a specific acoustic scene, which establishes a close relationship between audio tagging and acoustic scene classification (ASC). In this study, we explore the role and application of sound events based on the ASC task and propose the use of the last hidden layer’s output of an audio tagging system (<i>tag representation</i>), rather than the output itself (<i>tag vector</i>), in ASC. We hypothesize that the tag representation contains sound event information that can improve the classification accuracy of acoustic scenes. The dual attention mechanism is investigated to adequately emphasize the frequency-time and channel dimensions of the feature map of an ASC system using tag representation. Experiments are conducted using the Detection and Classification of Acoustic Scenes and Events 2020 task1-a dataset. The proposed system demonstrates an overall classification accuracy of 69.3%, compared to 65.3% of the baseline.