A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima

Abstract: Stochastic Gradient Descent (SGD) and its variants are mainstream methods for training deep networks in practice. SGD is known to find a flat minimum that often generalizes well. However, it is mathematically unclear how deep learning can select a flat minimum among so many minima. To answer the question quantitatively, we develop a density diffusion theory to reveal how minima selection quantitatively depends on the minima sharpness and the hyperparameters. To the best of our knowledge, we are the first to theoretically and empirically prove that, benefited from the Hessian-dependent covariance of stochastic gradient noise, SGD favors flat minima exponentially more than sharp minima, while Gradient Descent (GD) with injected white noise favors flat minima only polynomially more than sharp minima. We also reveal that either a small learning rate or large-batch training requires exponentially many iterations to escape from minima in terms of the ratio of the batch size and learning rate. Thus, large-batch training cannot search flat minima efficiently in a realistic computational time.

03/05/2021

Deep Learning, Bayesian Neural Networks, Neural Network Gaussian Process, Infinite-Width Limit, Uncertainty, Gaussian Process

4:34

18/07/2021

A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima

Zeke Xie, Issei Sato, Masashi Sugiyama

Comments

Similar Papers

Understanding Over-parameterization in Generative Adversarial Networks

Yogesh Balaji, Mohammadmahdi Sajedi, Neha Kalibhat and Mucong Ding, Dominik Stöger, Mahdi Soltanolkotabi, Soheil Feizi

Keywords Abstract Paper

min-max optimization, Over-parameterization, GAN

The Heavy-Tail Phenomenon in SGD

Mert Gurbuzbalaban, Umut Simsekli, Lingjiong Zhu

Keywords Abstract Paper

Optimization, Stochastic Optimization

Efficient Neural Network Training via Forward and Backward Propagation Sparsification

Xiao Zhou, Weizhong Zhang, Zonghao Chen and SHIZHE DIAO, Tong Zhang

Keywords Abstract Paper

deep learning, optimization

Distribution Adaptive INT8 Quantization for Training CNNs

Kang Zhao, Sida Huang, Pan Pan and Yinghan Li, Yingya Zhang, Zhenyu Gu, Yinghui Xu

Keywords Abstract Paper

GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning

Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, Rishabh Iyer

Keywords Abstract Paper

A Faster Decentralized Algorithm for Nonconvex Minimax Problems

Wenhan Xian, Feihu Huang, Yanfu Zhang, Heng Huang

Keywords Abstract Paper

optimization, machine learning, adversarial robustness and security

Extrapolation for Large-batch Training in Deep Learning

Tao LIN, Lingjing Kong, Sebastian Stich, Martin Jaggi

Keywords Abstract Paper

Deep Learning - Algorithms

Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

Umut Simsekli, Lingjiong Zhu, Yee Whye Teh, Mert Gurbuzbalaban

Keywords Abstract Paper

Deep Learning - Theory

Direction Matters: On the Implicit Bias of Stochastic Gradient Descent with Moderate Learning Rate

Jingfeng Wu, Difan Zou, vladimir braverman, Quanquan Gu

Keywords Abstract Paper

regularization, SGD, implicit bias

Improved Analysis of Clipping Algorithms for Non-convex Optimization

Bohang Zhang, Jikai Jin, Cong Fang, Liwei Wang

Keywords Abstract Paper

Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization

Zeke Xie, Li Yuan, Zhanxing Zhu, Masashi Sugiyama

Keywords Abstract Paper

Optimization, Stochastic Optimization

The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks

Bohan Wang, Qi Meng, Wei Chen, Tie-Yan Liu

Keywords Abstract Paper

Theory, Deep learning Theory

Towards Better Generalization of Adaptive Gradient Methods

Yingxue Zhou, Belhal Karimi, Jinxing Yu and Zhiqiang Xu, Ping Li

Keywords Abstract Paper

Gradient Descent on Two-layer Nets: Margin Maximization and Simplicity Bias

Kaifeng Lyu, Zhiyuan Li, Runzhe Wang, Sanjeev Arora

Keywords Abstract Paper

deep learning, optimization, machine learning

The Implicit Bias of Minima Stability: A View from Function Space

Rotem Mulayoff, Tomer Michaeli, Daniel Soudry

Keywords Abstract Paper

deep learning, optimization

Functional Gradient Boosting for Learning Residual-like Networks with Statistical Guarantees

Atsushi Nitanda, Taiji Suzuki

Keywords Abstract Paper

Exploring the Uncertainty Properties of Neural Networks’ Implicit Priors in the Infinite-Width Limit

Ben Adlam, Jaehoon Lee, Lechao Xiao and Jeffrey Pennington, Jasper Snoek

Keywords Abstract Paper

Deep Learning, Bayesian Neural Networks, Neural Network Gaussian Process, Infinite-Width Limit, Uncertainty, Gaussian Process

Self Normalizing Flows

T. Anderson Keller, Jorn Peters, Priyank Jaini and Emiel Hoogeboom, Patrick Forré, Max Welling

Keywords Abstract Paper

Deep Learning, Generative Models

Effect of Activation Functions on the Training of Overparametrized Neural Nets

Abhishek Panigrahi, Abhishek Shetty, Navin Goyal

Keywords Abstract Paper

activation functions, deep learning theory, neural networks

Implicit Bias of Gradient Descent based Adversarial Training on Separable Data

Yan Li, Ethan X.Fang, Huan Xu, Tuo Zhao

Keywords Abstract Paper

implicit bias, adversarial training, robustness, gradient descent

The Neural Tangent Kernel in High Dimensions: Triple Descent and a Multi-Scale Theory of Generalization

Yogesh Balaji, Mohammadmahdi Sajedi, Neha Kalibhat and
Mucong Ding, Dominik Stöger, Mahdi Soltanolkotabi, Soheil Feizi

Keywords Paper

Keywords Paper

Xiao Zhou, Weizhong Zhang, Zonghao Chen and
SHIZHE DIAO, Tong Zhang

Keywords Paper

Kang Zhao, Sida Huang, Pan Pan and
Yinghan Li, Yingya Zhang, Zhenyu Gu, Yinghui Xu

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Yingxue Zhou, Belhal Karimi, Jinxing Yu and
Zhiqiang Xu, Ping Li

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Ben Adlam, Jaehoon Lee, Lechao Xiao and
Jeffrey Pennington, Jasper Snoek

Keywords Paper

T. Anderson Keller, Jorn Peters, Priyank Jaini and
Emiel Hoogeboom, Patrick Forré, Max Welling

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Jonathan Schwarz, Siddhant M Jayakumar, Razvan Pascanu and
Peter E Latham, Yee Teh

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Jack Parker-Holder, Luke Metz, Cinjon Resnick and
Hengyuan Hu, Adam Lerer, Alistair Letcher, Alexander Peysakhovich, Aldo Pacchiano, Jakob Foerster

Keywords Paper

Keywords Paper

Keywords Paper

Kirill Struminsky, Artyom Gadetsky, Denis Rakitin and
Danil Karpushkin, Dmitry Vetrov

Keywords Paper

Keywords Paper

Keywords Paper