Abstract:
Boundary localization is a key component of most temporal action localization frameworks for untrimmed video. Deep-learning methods have brought remarkable progress in this field due to large-scale annotated datasets (e.g., THUMOS14 and ActivityNet). However, natural ambiguity exists for labeling accurate action boundary with such datasets. In this paper, we propose a method to model this uncertainty. Specifically, we construct a Gaussian model for predicting the uncertainty variance of boundary. The captured variance is further used to select more reliable proposals, and to refine proposal boundary by variance voting during post-processing. For most existing one- and two-stage frameworks, more accurate boundaries and reliable proposals can be obtained without additional computation. For the one-stage decoupled single-shot temporal action detection (Decouple-SSAD) framework, we first apply adaptive pyramid feature fusion method to fuse its features of different scales and optimize its structure. Then, we introduce the uncertainty based method, and improve state-of-the-art mAP@0.5 value from 37.9% to 41.6% on THUMOS14. Moreover, for the two-stage proposal–proposal interaction through a graph convolutional network (P-GCN), with such uncertainty method, we also gain significant improvements on both THUMOS14 and ActivityNet v1.3 datasets.