Abstract:
Sound event detection under real scenarios is a challenge task. Due to the great distribution mismatch of synthetic and real audio data, the performance of sound event detection model, which is trained on strong-labeled synthetic data, degrades dramatically when it is applied in real environment. To tackle the issue and improve the robustness of sound event detection model, we propose a two-stage domain adaptation sound event detection approach in this paper. The backbone convolutional recurrent neural network (CRNN) leaned using strong-labeled synthetic data is updated by weak-label supervised adaptation and frame-level adversarial do-main adaptation. As a result, the parameters of CRNN are renewed for real audio data, and the input space distribution mismatch be-tween synthetic and real audio data is mitigated in the feature space of CRNN. Moreover, a context clip-level consistency regulariza-tion between the classification outputs of CNN and CRNN is in-troduced to improve the feature representation ability of convolu-tional layers in CRNN. Experiments on DCASE 2019 sound event detection in domestic environments task demonstrate the superiori-ty of our proposed domain adaptation approach. Our approach achieves F1 scores of 48.3% on the validation set and 49.4% on the evaluation set, which are the-state-of-art sound event detection performances of CRNN model without data augmentation.