A CRNN-GRU based reinforcement learning approach to audio captioning

02/11/2020

A CRNN-GRU based reinforcement learning approach to audio captioning

Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

Keywords:

Abstract: Audio captioning aims at generating a natural sentence to describe the content in an audio clip. This paper proposes the use of a powerful CRNN encoder combined with a GRU decoder to tackle this multi-modal task. In addition to standard cross-entropy, reinforcement learning is also investigated for generating richer and more accurate captions. Our approach significantly improves against the baseline model on all shown metrics achieving a relative improvement of at least 34%. Results indicate that our proposed CRNN-GRU model with reinforcement learning achieves a SPIDEr of 0.190 on the Clotho evaluation set. With data augmentation, the performance is further boosted to 0.223. In the DCASE challenge Task 6 we ranked fourth based on SPIDEr, second on 5 metrics including BLEU, ROUGE-L and METEOR, without ensemble or data augmentation while maintaining a small model size (only 5 Million parameters).

A CRNN-GRU based reinforcement learning approach to audio captioning

Xuenan Xu, Heinrich Dinkel, Mengyue Wu, Kai Yu

Comments

Similar Papers