Momentum Improves Normalized SGD

12/07/2020

Momentum Improves Normalized SGD

Ashok Cutkosky, Harsh Mehta

Keywords: Optimization - Non-convex

Abstract Paper Similar Papers

Abstract: We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows normalized SGD with momentum to find an $\epsilon$-critical point in $O(1/\epsilon^{3.5})$ iterations, matching the best-known rates without accruing any logarithmic factors or dependence on dimension. We provide an adaptive learning rate schedule that automatically improves convergence rates when the variance in the gradients is small. Finally, we show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining, matching the performance of the disparate methods used to get state-of-the-art results on both tasks.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ICML 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

06/12/2020

Task-Robust Model-Agnostic Meta-Learning

Liam Collins, Aryan Mokhtari, Sanjay Shakkottai

Keywords Paper

0

0

0

0

3:17

03/05/2021

Rao-Blackwellizing the Straight-Through Gumbel-Softmax Gradient Estimator

Max B Paulus, Chris Maddison, Andreas Krause

Keywords Paper

softmax, gumbel, rao-blackwell, rao, straightthrough, straight-through, gumbel-softmax

0

0

0

0

13:25

26/08/2020

One Sample Stochastic Frank-Wolfe

Mingrui Zhang, Zebang Shen, Aryan Mokhtari and
Hamed Hassani, Amin Karbasi

Keywords Paper

0

0

0

0

6:05

03/05/2021

Deconstructing the Regularization of BatchNorm

Yann Dauphin, Ekin Cubuk

Keywords Paper

understanding neural networks, batch normalization, regularization, deep learning

0

0

0

0

5:09

26/08/2020

Gaussian-Smoothed Optimal Transport: Metric Structure and Statistical Efficiency

Ziv Goldfeld, Kristjan Greenewald

Keywords Paper

0

0

0

0

14:45

06/12/2020

Margins are Insufficient for Explaining Gradient Boosting

Allan Grønlund, Lior Kamma, Kasper Green Larsen

Keywords Paper

0

0

0

0

3:22

06/12/2021

Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections

Kimia Nadjahi, Alain Durmus, Pierre E Jacob and
Roland Badeau, Umut Simsekli

Keywords Paper

machine learning, generative model, optimal transport

0

0

0

0

6:20

26/04/2020

Combining Q-Learning and Search with Amortized Value Estimates

Jessica B. Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez and
Tobias Pfaff, Theophane Weber, Lars Buesing, Peter W. Battaglia

Keywords Paper

model-based RL, Q-learning, MCTS, search

0

0

0

0

4:37

06/12/2020

CSER: Communication-efficient SGD with Error Reset

Cong Xie, Shuai Zheng, Sanmi Koyejo and
Indranil Gupta, Mu Li, Haibin Lin

Keywords Paper

0

0

0

0

3:12

06/12/2020

Markovian Score Climbing: Variational Inference with KL(p||q)

Christian Naesseth, Fredrik Lindsten, David Blei

Keywords Paper

0

0

0

0

2:30

02/02/2021

Fast and Compact Bilinear Pooling by Shifted Random Maclaurin

Tan Yu, Xiaoyun Li, Ping Li

Keywords Paper

0

0

0

0

14:24

03/05/2021

AdamP: Slowing Down the Slowdown for Momentum Optimizers on Scale-invariant Weights

Byeongho Heo, Sanghyuk Chun, Seong Joon Oh and
Dongyoon Han, Sangdoo Yun, Gyuwan Kim, Youngjung Uh, Jung-Woo Ha

Keywords Paper

effective learning rate, normalize layer, scale-invariant weights, momentum optimizer

0

0

0

0

5:16

06/12/2020

Noise-Contrastive Estimation for Multivariate Point Processes

Hongyuan Mei, Tom Wan, Jason Eisner

Keywords Paper

0

0

0

0

3:20

06/12/2020

On the Error Resistance of Hinge-Loss Minimization

Kunal Talwar

Keywords Paper

0

0

0

0

3:13

06/12/2021

Simple steps are all you need: Frank-Wolfe and generalized self-concordant functions

Alejandro Carderera, Mathieu Besançon, Sebastian Pokutta

Keywords Paper

optimization, machine learning

0

0

0

0

15:30

12/07/2020

Random extrapolation for primal-dual coordinate descent

Ahmet Alacaoglu, Olivier Fercoq, Volkan Cevher

Keywords Paper

Optimization - Convex

0

0

0

0

14:34

04/08/2021

Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon

Zihan Zhang, Xiangyang Ji, Simon Du

Keywords Paper

0

0

0

0

12:37

06/12/2020

Optimal Variance Control of the Score-Function Gradient Estimator for Importance-Weighted Bounds

Valentin Liévin, Andrea Dittadi, Anders Christensen, Ole Winther

Keywords Paper

0

0

0

0

3:06

06/12/2020

Efficient Learning of Generative Models via Finite-Difference Score Matching

Tianyu Pang, Kun Xu, Chongxuan LI and
Yang Song, Stefano Ermon, Jun Zhu

Keywords Paper

0

0

0

0

2:59

06/12/2020

Random Reshuffling: Simple Analysis with Vast Improvements

Konstantin Mishchenko, Ahmed Khaled Ragab Bayoumi, Peter Richtarik

Keywords Paper

Reinforcement Learning and Planning -> Planning; Reinforcement Learning and Planning -> Reinforcement Learning, Reinforcement Learning and Planning

0

0

0

0

3:08

18/07/2021

On Signal-to-Noise Ratio Issues in Variational Inference for Deep Gaussian Processes

Tim G. J. Rudner, Oscar Key, Yarin Gal, Tom Rainforth

Keywords Paper

Probabilistic Methods, Gaussian Processes and Bayesian non-parametrics

0

0

0

0

5:09

26/08/2020

An Empirical Study of Stochastic Gradient Descent with Structured Covariance Noise

Yeming Wen, Kevin Luk, Maxime Gazeau and
Guodong Zhang, Harris Chan, Jimmy Ba

Keywords Paper

0

0

0

0

8:44

06/12/2020

GCN meets GPU: Decoupling “When to Sample” from “How to Sample”

Morteza Ramezani, Weilin Cong, Mehrdad Mahdavi and
Anand Sivasubramaniam, Mahmut Kandemir

Keywords Paper

0

0

0

0

3:24

18/07/2021

Improving Lossless Compression Rates via Monte Carlo Bits-Back Coding

Yangjun Ruan, Karen Ullrich, Daniel Severo and
James Townsend, Ashish Khisti, Arnaud Doucet, Alireza Makhzani, Chris Maddison

Keywords Paper

Probabilistic Methods

0

0

0

0

17:00

12/07/2020

Boosting Frank-Wolfe by Chasing Gradients

Cyrille Combettes, Sebastian Pokutta

Keywords Paper

Optimization - Convex

0

0

0

0

16:15

06/12/2021

Generalized DataWeighting via Class-Level Gradient Manipulation

Can Chen, Shuhao Zheng, Xi Chen and
Erqun Dong, Xue (Steve) Liu, Hao Liu, Dejing Dou

Keywords Paper

optimization, machine learning, meta learning

0

0

0

0

13:13

03/05/2021

RMSprop converges with proper hyper-parameter

Naichen Shi, Dawei Li, Mingyi Hong, Ruoyu Sun

Keywords Paper

convergence, hyperparameter, RMSprop

0

0

0

0

10:12

03/05/2021

Sharpness-aware Minimization for Efficiently Improving Generalization

Pierre Foret, Ariel Kleiner, Hossein Mobahi, Behnam Neyshabur

Keywords Paper

Generalization, Deep Learning, Training Method, Regularization, Sharpness Minimization

0

0

0

0

13:14

12/07/2020

Moniqua: Modulo Quantized Communication in Decentralized SGD

Yucheng Lu, Christopher De Sa

Keywords Paper

Optimization - Large Scale, Parallel and Distributed

0

0

0

0

14:57

14/06/2020

A Graduated Filter Method for Large Scale Robust Estimation

Huu Le, Christopher Zach

Keywords Paper

robust fitting, bundle adjustment, non-convex, poor local minima, non-linear least squares, graduated non-convexity.

0

0

0

0

1:01

06/12/2021

Heavy Ball Neural Ordinary Differential Equations

Hedi Xia, Vai Suliafu, Hangjie Ji and
Tan Nguyen, Andrea Bertozzi, Stanley Osher, Bao Wang

Keywords Paper

deep learning, optimization, machine learning, vision

0

0

0

0

4:08

03/05/2021

Gradient Origin Networks

Sam Bond-Taylor, Chris G Willcocks

Keywords Paper

Implicit Representation, Generative Models, Deep Learning

0

0

0

0

5:01

18/07/2021

Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness

Vien Mai, Mikael Johansson

Keywords Paper

Optimization, Stochastic Optimization

0

0

0

0

17:16

06/12/2020

Implicit Regularization and Convergence for Weight Normalization

Xiaoxia (Shirley) Wu, Edgar Dobriban, Tongzheng Ren and
Shanshan Wu, Zhiyuan Li, Suriya Gunasekar, Rachel Ward, Qiang Liu

Keywords Paper

0

0

0

0

3:31

04/07/2020

Posterior Calibrated Training on Sentence Classification Tasks

Taehee Jung, Dongyeop Kang, Hua Cheng and
Lucas Mentch, Thomas Schaaf

Keywords Paper

Sentence Tasks, classifications, xSLUE, classification tasks

0

0

0

0

7:00

06/12/2021

Preconditioned Gradient Descent for Over-Parameterized Nonconvex Matrix Factorization

Jialun Zhang, Salar Fattahi, Richard Y Zhang

Keywords Paper

optimization

0

0

0

0

8:36

04/08/2021

Concentration of Non-Isotropic Random Tensors with Applications to Learning and Empirical Risk Minimization

Mathieu Even, Laurent Massoulie

Keywords Paper

0

0

0

0

18:00

06/12/2021

Dual Parameterization of Sparse Variational Gaussian Processes

Vincent ADAM, Paul Chang, Mohammad Emtiyaz Khan, Arno Solin

Keywords Paper

optimization, generative model, kernel methods

0

0

0

0

13:29

06/12/2020

Adaptive Discretization for Model-Based Reinforcement Learning

Sean Sinclair, Tianyu Wang, Gauri Jain and
Sid Banerjee, Christina Yu

Keywords Paper

0

0

0

0

3:12

06/12/2020

A Spectral Energy Distance for Parallel Speech Synthesis

Alexey Gritsenko, Tim Salimans, Rianne van den Berg and
Jasper Snoek, Nal Kalchbrenner

Keywords Paper

0

0

0

0

3:11