Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

Abstract: The paper proposes and optimizes a partial recovery training system, CPR, for recommendation models. CPR relaxes the consistency requirement by enabling non-failed nodes to proceed without loading checkpoints when a node fails during training, improving failure-related overheads. The paper is the first to the extent of our knowledge to perform a data-driven, in-depth analysis of applying partial recovery to recommendation models and identified a trade-off between accuracy and performance. Motivated by the analysis, we present CPR, a partial recovery training system that can reduce the training time and maintain the desired level of model accuracy by (1) estimating the benefit of partial recovery, (2) selecting an appropriate checkpoint saving interval, and (3) prioritizing to save updates of more frequently accessed parameters. Two variants of CPR, CPR-MFU and CPR-SSU, reduce the checkpoint-related overhead from 8.2--8.5% to 0.53--0.68% compared to full recovery, on a setup emulating the failure pattern and overhead of a production-scale cluster. While reducing overhead significantly, CPR achieves model quality on par with the more expensive full recovery scheme, training the state-of-the-art recommendation model using Criteo’s Terabyte CTR dataset. Our results also suggest that CPR can speed up training on a real production-scale cluster, without notably degrading the accuracy.

03/05/2021

Understanding and Improving Failure Tolerant Training for Deep Learning Recommendation with Partial Recovery

Kiwan Maeng, Shivam Bharuka, Isabel Gao, Mark Jeffrey, Vikram Saraph, Bor-Yiing Su, Caroline Trippel, Jiyan Yang, Mike Rabbat, Brandon Lucia, Carole-Jean Wu

Comments

Similar Papers

A Gradient Flow Framework For Analyzing Network Pruning

Ekdeep Lubana, Robert Dick

Keywords Abstract Paper

Early pruning, Gradient flow, Network pruning

Conditional Channel Gated Networks for Task-Aware Continual Learning

Davide Abati, Jakub Tomczak, Tijmen Blankevoort and Simone Calderara, Rita Cucchiara, Babak Ehteshami Bejnordi

Keywords Abstract Paper

continual learning, channel gating, conditional computation, incremental learning, lifelong learning, hard attention

MetaAugment: Sample-Aware Data Augmentation Policy Learning

Fengwei Zhou, Jiawei Li, Chuanlong Xie and Fei Chen, Lanqing Hong, Rui Sun, Zhenguo Li

Keywords Abstract Paper

Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting

Sayna Ebrahimi, Suzanne Petryk, Akash Gokul and William Gan, Joseph E Gonzalez, Marcus Rohrbach, trevor darrell

Keywords Abstract Paper

Explainability, Catastrophic Forgetting, Continual Learning, XAI, Lifelong Learning

End-to-End Weak Supervision

Salva Rühling Cachay, Benedikt Boecking, Artur Dubrawski

Keywords Abstract Paper

deep learning, machine learning, robustness

Extrapolation for Large-batch Training in Deep Learning

Tao LIN, Lingjing Kong, Sebastian Stich, Martin Jaggi

Keywords Abstract Paper

Deep Learning - Algorithms

Robust Pre-Training by Adversarial Contrastive Learning

Ziyu Jiang, Tianlong Chen, Ting Chen, Zhangyang Wang

Keywords Abstract Paper

Time-Consistent Self-Supervision for Semi-Supervised Learning

Tianyi Zhou, Shengjie Wang, Jeff Bilmes

Keywords Abstract Paper

Unsupervised and Semi-Supervised Learning

CARTL: Cooperative Adversarially-Robust Transfer Learning

Dian Chen, Hongxin Hu, Qian Wang and Li Yinli, Cong Wang, Chao Shen, Qi Li

Keywords Abstract Paper

Algorithms, Adversarial Examples

Tightening Exploration in Upper Confidence Reinforcement Learning

Hippolyte Bourel, Odalric-Ambrym Maillard, Mohammad Sadegh Talebi

Keywords Abstract Paper

Reinforcement Learning - General

Auxiliary Task Reweighting for Minimum-data Learning

Baifeng Shi, Judy Hoffman, Kate Saenko and Trevor Darrell, Huijuan Xu

Keywords Abstract Paper

On the Exploration of Incremental Learning for Fine-grained Image Retrieval

Wei Chen, Yu Liu, Weiping Wang and Tinne Tuytelaars, Erwin M. Bakker, Michael Lew

Keywords Abstract Paper

Incremental learning, Fine-grained image retrieval, Catastrophic forgetting, Maximum Mean Discrepancy

Unsupervised Data Augmentation for Consistency Training

Qizhe Xie, Zihang Dai, Eduard Hovy and Thang Luong, Quoc V Le

Keywords Abstract Paper

Training Over-parameterized Models with Non-decomposable Objectives

Harikrishna Narasimhan, Aditya Menon

Keywords Abstract Paper

optimization, machine learning, fairness

Regularization via Structural Label Smoothing

Weizhi Li, Gautam Dasarathy, Visar Berisha

Keywords Abstract Paper

Composed Fine-Tuning: Freezing Pre-Trained Denoising Autoencoders for Improved Generalization

Sang Michael Xie, Tengyu Ma, Percy Liang

Keywords Abstract Paper

Algorithms, Multitask, Transfer, and Meta Learning

Tactical Optimism and Pessimism for Deep Reinforcement Learning

Ted Moskovitz, Jack Parker-Holder, Aldo Pacchiano and Michael Arbel, Michael Jordan

Keywords Abstract Paper

reinforcement learning and planning, bandits

Task-Robust Model-Agnostic Meta-Learning

Liam Collins, Aryan Mokhtari, Sanjay Shakkottai

Keywords Abstract Paper

Submodular Meta-Learning

Arman Adibi, Aryan Mokhtari, Hamed Hassani

Keywords Abstract Paper

Revealing and Protecting Labels in Distributed Training

Trung Dang, Om Thakkar, Swaroop Ramaswamy and Rajiv Mathews, Peter Chin, Françoise Beaufays

Keywords Abstract Paper

machine learning, vision, privacy, federated learning

Online Selective Classification with Limited Feedback

Aditya Gangrade, Anil Kag, Ashok Cutkosky, Venkatesh Saligrama

Keywords Abstract Paper

Keywords Paper

Davide Abati, Jakub Tomczak, Tijmen Blankevoort and
Simone Calderara, Rita Cucchiara, Babak Ehteshami Bejnordi

Keywords Paper

Fengwei Zhou, Jiawei Li, Chuanlong Xie and
Fei Chen, Lanqing Hong, Rui Sun, Zhenguo Li

Keywords Paper

Sayna Ebrahimi, Suzanne Petryk, Akash Gokul and
William Gan, Joseph E Gonzalez, Marcus Rohrbach, trevor darrell

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Dian Chen, Hongxin Hu, Qian Wang and
Li Yinli, Cong Wang, Chao Shen, Qi Li

Keywords Paper

Keywords Paper

Baifeng Shi, Judy Hoffman, Kate Saenko and
Trevor Darrell, Huijuan Xu

Keywords Paper

Wei Chen, Yu Liu, Weiping Wang and
Tinne Tuytelaars, Erwin M. Bakker, Michael Lew

Keywords Paper

Qizhe Xie, Zihang Dai, Eduard Hovy and
Thang Luong, Quoc V Le

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Ted Moskovitz, Jack Parker-Holder, Aldo Pacchiano and
Michael Arbel, Michael Jordan

Keywords Paper

Keywords Paper

Keywords Paper

Trung Dang, Om Thakkar, Swaroop Ramaswamy and
Rajiv Mathews, Peter Chin, Françoise Beaufays

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Kihyuk Sohn, David Berthelot, Nicholas Carlini and
Zizhao Zhang, Han Zhang, Colin A Raffel, Dogus Cubuk, Alexey Kurakin, Chun-Liang Li

Keywords Paper

Keywords Paper

Keywords Paper

Keywords Paper

Johan Björck, Xiangyu Chen, Christopher De Sa and
Carla Gomes, Kilian Weinberger

Keywords Paper

Keywords Paper

Eric Mitchell, Rafael Rafailov, Xue Bin Peng and
Sergey Levine, Chelsea Finn

Keywords Paper

Keywords Paper

Matthew Faw, Rajat Sen, Karthikeyan Shanmugam and
Constantine Caramanis, Sanjay Shakkottai

Keywords Paper

Keywords Paper