Abstract:
We address the challenging problem of learning motion representations using deep models for video recognition. To this end, we make use of attention modules that learn to highlight regions in the video and aggregate features for recognition. Specifically, we propose to leverage output attention maps as a vehicle to transfer the learned representation from a flow network to an RGB network. We systematically study the design of attention modules, and develop a novel method for attention distillation. Our method is evaluated on major action benchmarks. We show that our method not only improves the performance of the baseline RGB network by a significant margin. Moreover, we demonstrate that attention serves a more robust tool for knowledge distillation in video domain. We believe our method provides a step towards learning motion-aware representations in deep models and valuable insights for knowledge distillation. Our project page is available at https://aptx4869lm.github.io/AttentionDistillation/