A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation

14/06/2020

A Local-to-Global Approach to Multi-Modal Movie Scene Segmentation

Anyi Rao, Linning Xu, Yu Xiong, Guodong Xu, Qingqiu Huang, Bolei Zhou, Dahua Lin

Keywords: video-analysis, multi-modal, dataset, long-video, structural-representation, high-level-understanding, story/plot-understanding

Abstract Paper Similar Papers

Abstract: Scene, as the crucial unit of storytelling in movies, contains complex activities of actors and their interactions in a physical environment. Identifying the composition of scenes serves as a critical step towards semantic understanding of movies. This is very challenging compared to the videos studied in conventional vision problems, e.g. action recognition, as scenes in movies usually contain much richer temporal structures and more complex semantic information. Towards this goal, we scale up the scene segmentation task by building a large-scale video dataset MovieScenes, which contains 21K annotated scene segments from 150 movies. We further propose a local-to-global scene segmentation framework, which integrates multi-modal information across three levels, i.e. clip, segment, and movie. This framework is able to distill complex semantics from hierarchical temporal structures over a long movie, providing top-down guidance for scene segmentation. Our experiments show that the proposed network is able to segment a movie into scenes with high accuracy, consistently outperforming previous methods. We also found that pretraining on our MovieScenes can bring significant improvements to the existing approaches.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at CVPR 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

30/11/2020

Condensed Movies: Story Based Retrieval with Contextual Embeddings

Max Bain, Arsha Nagrani, Andrew Brown, Andrew Zisserman

Keywords Paper

0

0

0

0

9:46

19/04/2021

DOCENT: Learning self-supervised entity representations from large document collections

Yury Zemlyanskiy, Sudeep Gandhe, Ruining He and
Bhargav Kanagal, Anirudh Ravula, Juraj Gottweis, Fei Sha, Ilya Eckstein

Keywords Paper

0

0

0

0

6:37

01/07/2020

Screenplay Quality Assessment: Can We Predict Who Gets Nominated?

Ming-Chang Chiu, Tiantian Feng, Xiang Ren, Shrikanth Narayanan

Keywords Paper

0

0

0

0

8:27

01/07/2020

On Incorporating Structural Information to improve Dialogue Response Generation

Nikita Moghe, Priyesh Vijayan, Balaraman Ravindran, Mitesh M. Khapra

Keywords Paper

0

0

0

0

13:00

16/11/2020

Joint Estimation and Analysis of Risk Behavior Ratings in Movie Scripts

Victor Martinez, Krishna Somandepalli, Yalda Tehranian-Uhls, Shrikanth Narayanan

Keywords Paper

creative production, movie representations, multi-task approach, violent content

0

0

0

0

11:14

16/11/2020

Multi-view Story Characterization from Movie Plot Synopses and Reviews

Sudipta Kar, Gustavo Aguilar, Mirella Lapata, Thamar Solorio

Keywords Paper

characterizing stories, multi-view model, theme, style

0

0

0

0

8:59

02/02/2021

Movie Summarization via Sparse Graph Construction

Pinelopi Papalampidi, Frank Keller, Mirella Lapata

Keywords Paper

0

0

0

0

16:49

03/05/2021

MoVie: Revisiting Modulated Convolutions for Visual Counting and Beyond

Duy-Kien Nguyen, Vedanuj Goswami, Xinlei Chen

Keywords Paper

visual question answering, modulated convolution, common object counting, visual counting, visual reasoning

0

0

0

0

5:24

05/01/2021

Temporal Context Aggregation for Video Retrieval With Contrastive Learning

Jie Shao, Xin Wen, Bingchen Zhao, Xiangyang Xue

Keywords Paper

0

0

0

0

4:50

04/07/2020

ScriptWriter: Narrative-Guided Script Generation

Yutao Zhu, Ruihua Song, Zhicheng Dou and
Jian-Yun Nie, Jin Zhou

Keywords Paper

Narrative-Guided Generation, dialogue systems, ScriptWriter, model ScriptWriter

0

0

0

0

10:13

02/02/2021

CompFeat: Comprehensive Feature Aggregation for Video Instance Segmentation

Yang Fu, Linjie Yang, Ding Liu and
Thomas S. Huang, Humphrey Shi

Keywords Paper

0

0

0

0

16:24

04/07/2020

Screenplay Summarization Using Latent Narrative Structure

Pinelopi Papalampidi, Frank Keller, Lea Frermann, Mirella Lapata

Keywords Paper

Screenplay Summarization, summarization, general-purpose models, position heuristics

0

0

0

0

11:19

17/08/2020

Fast and deep facial deformations

Stephen W. Bailey, Dalton Omens, Paul Dilorenzo, James F. O’Brien

Keywords Paper

mesh deformations, facial animation, deep learning, function approximation, character rig

0

0

0

0

5:03

03/05/2021

A Good Image Generator Is What You Need for High-Resolution Video Synthesis

Yu Tian, Jian Ren, Menglei Chai and
Kyle Olszewski, Xi Peng, Dimitris Metaxas, Sergey Tulyakov

Keywords Paper

contrastive learning, cross-domain video generation, high-resolution video generation

0

0

0

0

10:03

16/11/2020

MovieChats: Chat like Humans in a Closed Domain

Hui Su, Xiaoyu Shen, Zhou Xiao and
Zheng Zhang, Ernie Chang, Cheng Zhang, Cheng Niu, Jie Zhou

Keywords Paper

in-depth chat, intent prediction, knowledge retrieval, neural approach

0

0

0

0

10:05

05/01/2021

Coarse Temporal Attention Network (CTA-Net) for Driver's Activity Recognition

Zachary Wharton, Ardhendu Behera, Yonghuai Liu, Nik Bessis

Keywords Paper

0

0

0

0

5:30

30/11/2020

Mask-Ranking Network for Semi-Supervised Video Object Segmentation

Wenjing Li, Xiang Zhang, Yujie Hu, Yingqi Tang

Keywords Paper

0

0

0

0

5:36

14/06/2020

Video Super-Resolution With Temporal Group Attention

Takashi Isobe, Songjiang Li, Xu Jia and
Shanxin Yuan, Gregory Slabaugh, Chunjing Xu, Ya-Li Li, Shengjin Wang, Qi Tian

Keywords Paper

video processing, video super-resolution

0

0

0

0

1:00

06/12/2021

Deep Contextual Video Compression

Jiahao Li, Bin Li, Yan Lu

Keywords Paper

0

0

0

0

6:33

05/01/2021

High-Quality Frame Interpolation via Tridirectional Inference

Jinsoo Choi, Jaesik Park, In So Kweon

Keywords Paper

0

0

0

0

4:08

16/11/2020

Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning

Zhiyuan Fang, Tejas Gokhale, Pratyay Banerjee and
Chitta Baral, Yezhou Yang

Keywords Paper

captioning, video understanding, video captioning, generating captions

0

0

0

0

12:02

04/07/2020

Dense-Caption Matching and Frame-Selection Gating for Temporal Localization in VideoQA

Hyounghun Kim, Zineng Tang, Mohit Bansal

Keywords Paper

Dense-Caption Matching, Temporal VideoQA, answering questions, frame problem

0

0

0

0

10:56

02/02/2021

SMART Frame Selection for Action Recognition

Shreyank N Gowda, Marcus Rohrbach, Laura Sevilla-Lara

Keywords Paper

0

0

0

0

14:10

14/06/2020

Video Instance Segmentation Tracking With a Modified VAE Architecture

Chung-Ching Lin, Ying Hung, Rogerio Feris, Linglin He

Keywords Paper

video instance segmentation, video object tracking, variational autoencoder, vae, gaussian process, multi-task learning

0

0

0

0

1:01

14/06/2020

Memory Enhanced Global-Local Aggregation for Video Object Detection

Yihong Chen, Yue Cao, Han Hu, Liwei Wang

Keywords Paper

video object detection, video analysis, object detection, memory, global-local aggregation

0

0

0

0

1:00

05/01/2021

PDAN: Pyramid Dilated Attention Network for Action Detection

Rui Dai, Srijan Das, Luca Minciullo and
Lorenzo Garattoni, Gianpiero Francesca, Francois Bremond

Keywords Paper

0

0

0

0

5:00

05/01/2021

Data-Efficient Alignment of Multimodal Sequences by Aligning Gradient Updates and Internal Feature Distributions

Jianan Wang, Boyang Li, Xiangyu Fan and
Jing Lin, Yanwei Fu

Keywords Paper

0

0

0

0

4:49

14/06/2020

Temporal Pyramid Network for Action Recognition

Ceyuan Yang, Yinghao Xu, Jianping Shi and
Bo Dai, Bolei Zhou

Keywords Paper

video understanding, action recognition, visual tempo, temporal pyramid

0

0

0

0

1:01

14/06/2020

DeepCap: Monocular Human Performance Capture Using Weak Supervision

Marc Habermann, Weipeng Xu, Michael Zollhöfer and
Gerard Pons-Moll, Christian Theobalt

Keywords Paper

monocular human performance capture, 3d pose estimation, non-rigid surface deformation, human body

0

0

0

0

4:56

04/07/2020

MART: Memory-Augmented Recurrent Transformer for Coherent Video Paragraph Captioning

Jie Lei, Liwei Wang, Yelong Shen and
Dong Yu, Tamara Berg, Mohit Bansal

Keywords Paper

Coherent Captioning, Generating descriptions, captioning tasks, coherent generation

0

0

0

0

10:51

14/06/2020

Action Genome: Actions As Compositions of Spatio-Temporal Scene Graphs

Jingwei Ji, Ranjay Krishna, Li Fei-Fei, Juan Carlos Niebles

Keywords Paper

action recognition, scene graph, video understanding, relationships, composition, action, activity, video

0

0

0

0

1:01

02/02/2021

Large Motion Video Super-Resolution with Dual Subnet and Multi-Stage Communicated Upsampling

Hongying Liu, Peng Zhao, Zhubo Ruan and
Fanhua Shang, Yuanyuan Liu

Keywords Paper

0

0

0

0

19:47

14/06/2020

Classifying, Segmenting, and Tracking Object Instances in Video with Mask Propagation

Gedas Bertasius, Lorenzo Torresani

Keywords Paper

instance segmentation, object detection, object tracking, video analysis.

0

0

0

0

4:59

05/01/2021

Alleviating Over-Segmentation Errors by Detecting Action Boundaries

Yuchi Ishikawa, Seito Kasai, Yoshimitsu Aoki, Hirokatsu Kataoka

Keywords Paper

0

0

0

0

4:48

06/12/2020

Convolutional Tensor-Train LSTM for Spatio-Temporal Learning

Jiahao Su, Wonmin Byeon, Jean Kossaifi and
Furong Huang, Jan Kautz, Anima Anandkumar

Keywords Paper

0

0

0

0

3:29

06/12/2020

Self-Supervised MultiModal Versatile Networks

Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider and
Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman

Keywords Paper

1

0

0

0

3:25

14/06/2020

Listen to Look: Action Recognition by Previewing Audio

Ruohan Gao, Tae-Hyun Oh, Kristen Grauman, Lorenzo Torresani

Keywords Paper

action recognition, audio-visual learning, multi-modal learning, cross-modal learning, video understanding

0

0

0

0

1:01

14/06/2020

TEA: Temporal Excitation and Aggregation for Action Recognition

Yan Li, Bin Ji, Xintian Shi and
Jianguo Zhang, Bin Kang, Limin Wang

Keywords Paper

action recognition, temporal modeling, motion encoding, temporal aggregation

0

0

0

0

1:01

14/06/2020

Syntax-Aware Action Targeting for Video Captioning

Qi Zheng, Chaoyue Wang, Dacheng Tao

Keywords Paper

video and language, video captioning, action predicting

0

0

0

0

1:01

22/11/2021

Knowing What, Where and When to Look: Video Action modelling with Attention

Juan-Manuel Perez-Rua, Brais Martinez, Xiatian Zhu and
Antoine S Toisoul, Victor A Escorcia, Tao Xiang

Keywords Paper

Action recognition, Fine-grained action, video attention, Spatial attention, Channel attention, Temporal attention, Spatio-temporal attention, Feature refinement

0

0

0

0

2:46