Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

18/07/2021

Catastrophic Fisher Explosion: Early Phase Fisher Matrix Impacts Generalization

Stanislaw Jastrzebski, Devansh Arpit, Oliver Astrand, Giancarlo Kerg, Huan Wang, Caiming Xiong, Richard Socher, Kyunghyun Cho, Krzysztof J Geras

Keywords: Deep Learning, Optimization for Deep Networks

Abstract Paper Similar Papers

Abstract: The early phase of training a deep neural network has a dramatic effect on the local curvature of the loss function. For instance, using a small learning rate does not guarantee stable optimization because the optimization trajectory has a tendency to steer towards regions of the loss surface with increasing local curvature. We ask whether this tendency is connected to the widely observed phenomenon that the choice of the learning rate strongly influences generalization. We first show that stochastic gradient descent (SGD) implicitly penalizes the trace of the Fisher Information Matrix (FIM), a measure of the local curvature, from the start of training. We argue it is an implicit regularizer in SGD by showing that explicitly penalizing the trace of the FIM can significantly improve generalization. We highlight that poor final generalization coincides with the trace of the FIM attaining a large value early in training, to which we refer as catastrophic Fisher explosion. Finally, to gain insight into the regularization effect of penalizing the trace of the FIM, we show that it limits memorization by reducing the learning speed of examples with noisy labels more than that of the examples with clean labels.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at ICML 2021 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

26/04/2020

The Break-Even Point on Optimization Trajectories of Deep Neural Networks

Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort and
Devansh Arpit, Jacek Tabor, Kyunghyun Cho, Krzysztof Geras

Keywords Paper

generalization, sgd, learning rate, batch size, hessian, curvature, trajectory, optimization

0

0

0

0

4:42

12/07/2020

Extrapolation for Large-batch Training in Deep Learning

Tao LIN, Lingjing Kong, Sebastian Stich, Martin Jaggi

Keywords Paper

Deep Learning - Algorithms

0

0

0

0

13:21

02/02/2021

Distribution Adaptive INT8 Quantization for Training CNNs

Kang Zhao, Sida Huang, Pan Pan and
Yinghan Li, Yingya Zhang, Zhenyu Gu, Yinghui Xu

Keywords Paper

0

0

0

0

16:42

06/12/2021

What training reveals about neural network complexity

Andreas Loukas, Marinos Poiitis, Stefanie Jegelka

Keywords Paper

deep learning

0

0

0

0

8:29

03/05/2021

Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods

Taiji Suzuki, Akiyama Shunta

Keywords Paper

local Rademacher complexity, minimax optimal rate, Excess risk, linear estimator, kernel method, fast learning rate

0

0

0

0

10:13

02/02/2021

Deep Frequency Principle Towards Understanding Why Deeper Learning Is Faster

Zhiqin John Xu, Hanxu Zhou

Keywords Paper

0

0

0

0

19:40

12/07/2020

Double Trouble in Double Descent: Bias and Variance(s) in the Lazy Regime

Stéphane d'Ascoli, Maria Refinetti, Giulio Biroli, Florent Krzakala

Keywords Paper

Deep Learning - Theory

0

0

0

0

15:11

02/02/2021

Tackling Instance-Dependent Label Noise via a Universal Probabilistic Model

Qizhou Wang, Bo Han, Tongliang Liu and
Gang Niu, Jian Yang, Chen Gong

Keywords Paper

0

0

0

0

14:56

03/05/2021

Bypassing the Ambient Dimension: Private SGD with Gradient Subspace Identification

Yingxue Zhou, Steven Wu, Arindam Banerjee

Keywords Paper

0

0

0

0

5:02

06/12/2021

A Theoretical Analysis of Fine-tuning with Linear Teachers

Gal Shachaf, Alon Brutzkus, Amir Globerson

Keywords Paper

theory, deep learning, transfer learning

0

0

0

0

14:01

03/05/2021

Revisiting Locally Supervised Learning: an Alternative to End-to-end Training

Yulin Wang, Zanlin Ni, Shiji Song and
Le Yang, Gao Huang

Keywords Paper

Deep learning, Locally supervised training

1

0

0

1

5:03

06/12/2020

Towards Better Generalization of Adaptive Gradient Methods

Yingxue Zhou, Belhal Karimi, Jinxing Yu and
Zhiqiang Xu, Ping Li

Keywords Paper

0

0

0

0

3:21

12/07/2020

Towards Understanding the Dynamics of the First-Order Adversaries

Zhun Deng, Hangfeng He, Jiaoyang Huang, Weijie Su

Keywords Paper

Adversarial Examples

0

0

0

0

11:05

13/04/2021

On the generalization properties of adversarial training

Yue Xing, Qifan Song, Guang Cheng

Keywords Paper

0

0

0

0

3:05

02/02/2021

Fast and Scalable Adversarial Training of Kernel SVM via Doubly Stochastic Gradients

Huimin Wu, Zhengmian Hu, Bin Gu

Keywords Paper

0

0

0

0

14:04

03/05/2021

A Diffusion Theory For Deep Learning Dynamics: Stochastic Gradient Descent Exponentially Favors Flat Minima

Zeke Xie, Issei Sato, Masashi Sugiyama

Keywords Paper

flat minima, SGD, deep learning dynamics, stochastic optimization, diffusion

0

0

0

0

4:37

18/07/2021

The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks

Bohan Wang, Qi Meng, Wei Chen, Tie-Yan Liu

Keywords Paper

Theory, Deep learning Theory

0

0

0

0

16:53

26/04/2020

Can gradient clipping mitigate label noise?

Aditya Krishna Menon, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar

Keywords Paper

0

0

0

0

4:56

06/12/2021

AC-GC: Lossy Activation Compression with Guaranteed Convergence

R David Evans, Tor Aamodt

Keywords Paper

deep learning, optimization, graph learning

0

0

0

0

14:39

12/07/2020

Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

Umut Simsekli, Lingjiong Zhu, Yee Whye Teh, Mert Gurbuzbalaban

Keywords Paper

Deep Learning - Theory

0

0

0

0

15:37

18/07/2021

Lipschitz normalization for self-attention layers with application to graph neural networks

George Dasoulas, Kevin Scaman, Aladin Virmaux

Keywords Paper

Deep Learning

0

0

0

0

4:53

20/07/2020

A type of generalization error induced by initialization in deep neural networks

Yaoyu Zhang, Zhi-Qin John Xu, Tao Luo, Zheng Ma

Keywords Paper

0

0

0

0

17:33

06/12/2021

Scaling Ensemble Distribution Distillation to Many Classes with Proxy Targets

Max Ryabinin, Andrey Malinin, Mark Gales

Keywords Paper

machine learning

0

0

0

0

12:36

06/12/2020

Efficient Marginalization of Discrete and Structured Latent Variables via Sparsity

Gonçalo Correia, Vlad Niculae, Wilker Aziz, André Martins

Keywords Paper

0

0

0

0

3:38

19/08/2021

Two-stage Training for Learning from Label Proportions

Jiabin Liu, Bo Wang, Xin Shen and
Zhiquan Qi, Yingjie Tian

Keywords Paper

Machine Learning, Classification, Deep Learning, Weakly Supervised Learning

0

0

0

0

13:23

06/12/2021

Network-to-Network Regularization: Enforcing Occam's Razor to Improve Generalization

Rohan Ghosh, Mehul Motani

Keywords Paper

theory, deep learning, machine learning

0

0

0

0

14:07

13/04/2021

Learning with gradient descent and weakly convex losses

Dominic Richards, Mike Rabbat

Keywords Paper

0

0

0

0

3:20

03/05/2021

Meta-learning with negative learning rates

Alberto Bernacchia

Keywords Paper

Meta-learning

0

0

0

0

5:19

19/08/2021

Towards Understanding the Spectral Bias of Deep Learning

Yuan Cao, Zhiying Fang, Yue Wu and
Ding-Xuan Zhou, Quanquan Gu

Keywords Paper

Machine Learning, Deep Learning, Kernel Methods

0

0

0

0

14:42

05/04/2021

An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems

Ahmed M. Abdelmoniem, Ahmed Elzanaty Elzanaty, Mohamed-Slim Alouini , Marco Canini

Keywords Paper

0

0

0

0

22:37

05/04/2021

An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems

Ahmed M. Abdelmoniem, Ahmed Elzanaty Elzanaty, Mohamed-Slim Alouini , Marco Canini

Keywords Paper

0

0

0

0

4:13

06/12/2020

Auxiliary Task Reweighting for Minimum-data Learning

Baifeng Shi, Judy Hoffman, Kate Saenko and
Trevor Darrell, Huijuan Xu

Keywords Paper

0

0

0

0

3:28

06/12/2021

Understanding and Improving Early Stopping for Learning with Noisy Labels

Yingbin Bai, Erkun Yang, Bo Han and
Yanhua Yang, Jiatong Li, Yinian Mao, Gang Niu, Tongliang Liu

Keywords Paper

deep learning, optimization, machine learning, vision

0

0

0

0

8:19

06/12/2021

Constrained Optimization to Train Neural Networks on Critical and Under-Represented Classes

Sara Sangalli, Ertunc Erdil, Andeas Hötker and
Olivio Donati, Ender Konukoglu

Keywords Paper

deep learning, optimization, machine learning

0

0

0

0

14:04

23/08/2020

Rethinking pruning for accelerating deep inference at the edge

Dawei Gao, Xiaoxi He, Zimu Zhou and
Yongxin Tong, Ke Xu, Lothar Thiele

Keywords Paper

automatic speech recognition, deep learning, name entity recognition, network pruning, sequence labelling

0

0

0

0

13:43

26/04/2020

Revisiting Self-Training for Neural Sequence Generation

Junxian He, Jiatao Gu, Jiajun Shen, Marc'Aurelio Ranzato

Keywords Paper

self-training, semi-supervised learning, neural sequence generatioin

0

0

0

0

5:07

02/02/2021

Step-Ahead Error Feedback for Distributed Training with Compressed Gradient

An Xu, Zhouyuan Huo, Heng Huang

Keywords Paper

0

0

0

0

18:26

03/05/2021

Exploring the Uncertainty Properties of Neural Networks’ Implicit Priors in the Infinite-Width Limit

Ben Adlam, Jaehoon Lee, Lechao Xiao and
Jeffrey Pennington, Jasper Snoek

Keywords Paper

Deep Learning, Bayesian Neural Networks, Neural Network Gaussian Process, Infinite-Width Limit, Uncertainty, Gaussian Process

0

0

0

0

4:34

05/01/2021

Group Softmax Loss With Discriminative Feature Grouping

Takumi Kobayashi

Keywords Paper

0

0

0

0

4:49

18/07/2021

A Distribution-dependent Analysis of Meta Learning

Mikhail Konobeev, Ilja Kuzborskij, Csaba Szepesvari

Keywords Paper

Theory, Statistical Learning Theory

0

0

0

0

5:06