Abstract:
Detecting objects in a video can be difficult due to occlusions and motion blur, where the output features are easily deteriorated. Recent state-of-the-art methods propose to enhance the features of the key frame with reference frames using attention modules. However, the feature enhancement uses the features extracted from a fixed backbone. It is fundamentally hard for a fixed backbone to generate discriminative features for the frames of both low and high quality. To mitigate this challenge, in this paper, we present a meta-learning scheme that learns to adapt the backbone using temporal features. Specifically, we propose to summarise the temporal feature into a fixed size representation, which is then used to make the backbone generate adaptively discriminative features for low and high quality frames. We demonstrate that the proposed approach can be easily incorporated into latest temporal aggregation approaches with almost no impact on the inference speed. Experiments on ImageNet VID dataset show a consistent gain over state-of-the-art methods.