Clockwork Variational Autoencoders

Abstract: Deep learning has enabled algorithms to generate realistic images. However, accurately predicting long video sequences requires understanding long-term dependencies and remains an open challenge. While existing video prediction models succeed at generating sharp images, they tend to fail at accurately predicting far into the future. We introduce the Clockwork VAE (CW-VAE), a video prediction model that leverages a hierarchy of latent sequences, where higher levels tick at slower intervals. We demonstrate the benefits of both hierarchical latents and temporal abstraction on 4 diverse video prediction datasets with sequences of up to 1000 frames, where CW-VAE outperforms top video prediction models. Additionally, we propose a Minecraft benchmark for long-term video prediction. We conduct several experiments to gain insights into CW-VAE and confirm that slower levels learn to represent objects that change more slowly in the video, and faster levels learn to represent faster objects.

06/12/2020

Wonkwang Lee, Whie Jung, Han Zhang and
Ting Chen, Jing Yu Koh, Thomas E Huang, Hyungsuk Yoon, Honglak Lee, Seunghoon Hong

video frame interpolation, video temporal super-resolution, frame rate up conversion, frame synthesis, motion estimation, motion compensation, frame warping

1:01

26/04/2020

Reinforcement Learning, Learning to Optimize, Combinatorial Optimization, Compilers, Code Optimization, Neural Networks, ML for Systems, Learning for Systems

4:55

06/12/2021

video annotation, semi-automatic annotation, graph convolutional network, region boundaries, sparse bounding boxes, automatic boundary finding

9:37

14/06/2020

space-time video super-resolution, high-resolution, slow motion, one-stage, fast and accurate, feature temporal interpolation, deformable convlstm, temporal alignment, temporal aggregation, video restoration

1:00

22/11/2021

high resolution video inpainting, spatial-temporal aggregation, residual aggregation, spatial-temporal attention, image alignment

2:58

30/11/2020

efficient training and inference, video models, video understanding, backpropagation, backprop, scalable machine learning, depth-parallel training

0:59

14/06/2020

zero-shot learning, video classification, end-to-end, word2vec, visual to semantic, limited supervision, r3d, kinetics, sun, ucf101

1:01

14/06/2020

Irwan Bello, William Fedus, Xianzhi Du and
Ekin Dogus Cubuk, Aravind Srinivas, Tsung-Yi Lin, Jonathon Shlens, Barret Zoph

Bowen Pan, Rameswar Panda, Camilo L Fosco and
Chung-Ching Lin, Alex J Andonian, Yue Meng, Kate Saenko, Aude Oliva, Rogerio Feris

self-supervised learning, video segmentation, memory-augmented model, video understanding, tracking, unsupervised learning, generalization, attention, representation learning, metric learning

1:01

30/11/2020