ResT: An Efficient Transformer for Visual Recognition

06/12/2021

ResT: An Efficient Transformer for Visual Recognition

Qinglong Zhang, Yu-Bin Yang

Keywords: machine learning, transformers, vision

Abstract Paper Similar Papers

Abstract: This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Positional encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at NeurIPS 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

14/06/2020

Blurry Video Frame Interpolation

Wang Shen, Wenbo Bao, Guangtao Zhai and
Li Chen, Xiongkuo Min, Zhiyong Gao

Keywords Paper

video frame interpolation, frame-rate up-conversion, video deblurring, pyramid framework, spatial and temporal optimization

0

0

0

0

5:01

22/11/2021

Feature Fusion Vision Transformer for Fine-Grained Visual Categorization

Jun Wang, Xiaohan Yu, Yongsheng Gao

Keywords Paper

Fine-grained visual categorization, Vision transformer, Self-attention, Feature Fusion

0

0

0

0

3:02

06/12/2021

SOFT: Softmax-free Transformer with Linear Complexity

Jiachen Lu, Jinghan Yao, Junge Zhang and
Xiatian Zhu, Hang Xu, Weiguo Gao, Chunjing XU, Tao Xiang, Li Zhang

Keywords Paper

robustness, transformers, language

0

0

0

0

8:04

19/08/2021

GSPL: A Succinct Kernel Model for Group-Sparse Projections Learning of Multiview Data

Danyang Wu, Jin Xu, Xia Dong and
Meng Liao, Rong Wang, Feiping Nie, Xuelong Li

Keywords Paper

Machine Learning, Learning Sparse Models, Multi-instance; Multi-label; Multi-view learning, Unsupervised Learning

0

0

0

0

11:48

06/12/2021

Trash or Treasure? An Interactive Dual-Stream Strategy for Single Image Reflection Separation

Qiming Hu, Xiaojie Guo

Keywords Paper

deep learning

0

0

0

0

12:25

02/02/2021

MAMBA: Multi-level Aggregation via Memory Bank for Video Object Detection

Guanxiong Sun, Yang Hua, Guosheng Hu, Neil Robertson

Keywords Paper

0

0

0

0

16:48

22/11/2021

Temporal Meta-Adaptor for Video Object Detection

Chi Wang, Yang Hua, ZHENG LU and
Jian Gao, Neil Robertson

Keywords Paper

video object detection, temporal aggregation, meta-learning, ImageNet VID

0

0

0

0

6:58

02/02/2021

DenserNet: Weakly Supervised Visual Localization Using Multi-Scale Feature Aggregation

Dongfang Liu, Yiming Cui, Liqi Yan and
Christos Mousas, Baijian Yang, Yingjie Chen

Keywords Paper

0

0

0

0

16:15

06/12/2021

A Multi-Implicit Neural Representation for Fonts

Pradyumna Reddy, Zhifei Zhang, Matthew Fisher and
Hailin Jin, Zhaowen Wang, Niloy Mitra

Keywords Paper

deep learning, representation learning

0

0

0

0

8:42

03/08/2020

Locally Masked Convolution for Autoregressive Models

Ajay Jain, Pieter Abbeel, Deepak Pathak

Keywords Paper

0

0

0

0

8:28

14/06/2020

Efficient Dynamic Scene Deblurring Using Spatially Variant Deconvolution Network With Optical Flow Guided Training

Yuan Yuan, Wei Su, Dandan Ma

Keywords Paper

dynamic scene deblurring, deconvolution neural network, bi-directional optical flow, deformable convolution, deep learning, image restoration

0

0

0

0

0:57

06/12/2021

Improved Transformer for High-Resolution GANs

Long Zhao, Zizhao Zhang, Ting Chen and
Dimitris Metaxas, Han Zhang

Keywords Paper

transformers, generative model

0

0

0

0

12:11

26/04/2020

U-GAT-IT: Unsupervised Generative Attentional Networks with Adaptive Layer-Instance Normalization for Image-to-Image Translation

Junho Kim, Minjae Kim, Hyeonwoo Kang, Kwang Hee Lee

Keywords Paper

Image-to-Image Translation, Generative Attentional Networks, Adaptive Layer-Instance Normalization

0

0

1

1

5:08

30/11/2020

Image Captioning through Image Transformer

Sen He, Wentong Liao, Hamed R. Tavakoli and
Michael Yang, Bodo Rosenhahn, Nicolas Pugeault

Keywords Paper

0

0

0

0

9:49

06/12/2021

Gaussian Kernel Mixture Network for Single Image Defocus Deblurring

Yuhui Quan, Zicong Wu, Hui Ji

Keywords Paper

deep learning

0

0

0

0

13:56

06/12/2020

Unsupervised Learning of Visual Features by Contrasting Cluster Assignments

Mathilde Caron, Ishan Misra, Julien Mairal and
Priya Goyal, Piotr Bojanowski, Armand Joulin

Keywords Paper

0

1

0

0

3:22

26/08/2020

Deterministic Decoding for Discrete Data in Variational Autoencoders

Daniil Polykovskiy, Dmitry Vetrov

Keywords Paper

0

0

0

0

9:00

30/11/2020

Mask-Ranking Network for Semi-Supervised Video Object Segmentation

Wenjing Li, Xiang Zhang, Yujie Hu, Yingqi Tang

Keywords Paper

0

0

0

0

5:36

14/06/2020

FeatureFlow: Robust Video Interpolation via Structure-to-Texture Generation

Shurui Gui, Chaoyue Wang, Qihua Chen, Dacheng Tao

Keywords Paper

frame interpolation, slow motion, video processing, generation framework, deep learning, computer vision

0

0

0

0

1:00

18/11/2020

Semantic-guided shared feature alignment for occluded person re-IDentification

Xuena Ren, Dongming Zhang, Xiuguo Bao

Keywords Paper

0

0

0

0

11:03

22/11/2021

Contextual Convolution Blocks

David Marwood, Shumeet Baluja

Keywords Paper

spatially selective features, convolutional layer, cc-block, self-attention, se-block, squeeze and excitation, excitation map

0

0

0

0

2:45

06/12/2021

Global Filter Networks for Image Classification

Yongming Rao, Wenliang Zhao, Zheng Zhu and
Jiwen Lu, Jie Zhou

Keywords Paper

machine learning, robustness, transformers, vision

0

0

0

0

9:28

30/11/2020

Robust High Dynamic Range (HDR) Imaging with Complex Motion and Parallax

Zhiyuan Pu, Peiyao Guo, M. Salman Asif, Zhan Ma

Keywords Paper

0

0

0

0

7:38

14/06/2020

Softmax Splatting for Video Frame Interpolation

Simon Niklaus, Feng Liu

Keywords Paper

video frame interpolation, softmax splatting, differentiable forward warping, feature pyramids for image synthesis

0

0

0

0

1:00

06/12/2021

Joint Semantic Mining for Weakly Supervised RGB-D Salient Object Detection

Jingjing Li, Wei Ji, Qi Bi and
Cheng Yan, Miao Zhang, Yongri Piao, Huchuan Lu, Li cheng

Keywords Paper

vision

0

0

0

0

9:03

14/06/2020

Breaking the Cycle – Colleagues Are All You Need

Ori Nizan, Ayellet Tal

Keywords Paper

image-to-image translation, unpaired domain, generative adversarial networks, council-gan, multimodal, style transfer

0

0

0

0

1:01

06/12/2021

Multi-View Representation Learning via Total Correlation Objective

HyeongJoo Hwang, Geon-Hyeong Kim, Seunghoon Hong, Kee-Eung Kim

Keywords Paper

machine learning, generative model, representation learning

0

0

0

0

15:06

18/07/2021

Image-Level or Object-Level? A Tale of Two Resampling Strategies for Long-Tailed Detection

Nadine Chang, Zhiding Yu, Yu-Xiong Wang and
Anima Anandkumar, Sanja Fidler, Jose Alvarez

Keywords Paper

Applications, Computer Vision

0

0

0

0

5:17

14/06/2020

Spatially Attentive Output Layer for Image Classification

Ildoo Kim, Woonhyuk Baek, Sungwoong Kim

Keywords Paper

network architecture, attention, self-supervision, weakly-supervised object localization, interpretability, classification, cutmix, cnn

0

0

0

0

1:01

07/09/2020

Making a Case for 3D Convolutions for Object Segmentation in Videos

Sabarinath Mahadevan, Ali Athar, Aljosa Osep and
Laura Leal-Taixé, Bastian Leibe, Sebastian Hennen

Keywords Paper

object tracking, video segmentation, video object segmentation, video scene understanding, object segmentation

0

0

0

0

8:16

18/07/2021

You Only Sample (Almost) Once: Linear Cost Self-Attention Via Bernoulli Sampling

Zhanpeng Zeng, Yunyang Xiong, Sathya Ravi and
Shailesh Acharya, Glenn Fung, Vikas Singh

Keywords Paper

Applications, Natural Language Processing

0

0

0

0

5:16

14/06/2020

Learning Fused Pixel and Feature-Based View Reconstructions for Light Fields

Jinglei Shi, Xiaoran Jiang, Christine Guillemot

Keywords Paper

light field, view synthesis, feature-based reconstruction, pixel-based reconstruction, deep learning, angular super-resolution

0

0

0

0

4:56

25/07/2020

3D self-attention for unsupervised video quantization

Jingkuan Song, Ruimin Lang, Xiaosu Zhu and
Xing Xu, Lianli Gao, Heng Tao Shen

Keywords Paper

quantization, video retrieval, ann search

0

0

0

0

9:44

06/12/2020

CoMIR: Contrastive Multimodal Image Representation for Registration

Nicolas Pielawski, Elisabeth Wetzer, Johan Öfverstedt and
Jiahao Lu, Carolina Wählby, Joakim Lindblad, Natasa Sladoje

Keywords Paper

0

0

0

0

2:55

14/06/2020

UCTGAN: Diverse Image Inpainting Based on Unsupervised Cross-Space Translation

Lei Zhao, Qihang Mo, Sihuan Lin and
Zhizhong Wang, Zhiwen Zuo, Haibo Chen, Wei Xing, Dongming Lu

Keywords Paper

image inpainting, diverse image inpainting, image completion, unsupervised cross-space translation, diverse image generation, deep-learning based inpainting, deep learning, multiple-solution inpainting

0

0

0

0

1:01

03/05/2021

Fuzzy Tiling Activations: A Simple Approach to Learning Sparse Representations Online

Yangchen Pan, Kirby Banman, Martha White

Keywords Paper

natural sparsity, Reinforcement learning, fuzzy tiling activation function, sparse representation

0

0

0

1

6:22

02/02/2021

Explicitly Modeled Attention Maps for Image Classification

Andong Tan, Duc Tam Nguyen, Maximilian Dax and
Matthias Nießner, Thomas Brox

Keywords Paper

0

0

0

0

16:59

18/07/2021

OmniNet: Omnidirectional Representations from Transformers

Yi Tay, Mostafa Dehghani, Vamsi Aribandi and
Jai Gupta, Philip Pham, Zhen Qin, Dara Bahri, Da-Cheng Juan, Don Metzler

Keywords Paper

Deep Learning, Predictive Models, Algorithms, Representation Learning; Neuroscience and Cognitive Science; Neuroscience and Cognitive Science, Problem Solvin, Deep Learning, Architectures

0

0

0

0

17:00

22/11/2021

Hierarchical Interaction Network for Video Object Segmentation from Referring Expressions

Zhao Yang, Yansong Tang, Luca Bertinetto and
Hengshuang Zhao, Philip Torr

Keywords Paper

segmentation, video object segmentation, referring segmentation, referring video object segmentation, video object segmentation from referring expressions, referring image segmentation, referring image comprehension, optical flow, visual grounding

0

0

0

0

2:57

06/12/2020

RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference

Oindrila Saha, Aditya Kusupati, Harsha Simhadri and
Manik Varma, Prateek Jain

Keywords Paper

0

0

0

0

3:30