Spotnik: Designing Distributed Machine Learning for Transient Cloud Resources

13/07/2020

Spotnik: Designing Distributed Machine Learning for Transient Cloud Resources

Marcel Wagenländer, Luo Mai, Guo Li, Peter Pietzuch

Keywords:

Abstract Paper Similar Papers

Abstract: To achieve higher utilisation, cloud providers offer VMs with GPUs as lower-cost transient cloud resources. Transient VMs can be revoked at short notice and vary in their availability. This poses challenges to distributed machine learning (ML) jobs, which perform long-running stateful computation in which many workers maintain and synchronise model replicas. With transient VMs, existing systems either require a fixed number of reserved VMs or degrade performance when recovering from revoked transient VMs. We believe that future distributed ML systems must be designed from the ground up for transient cloud resources. This paper describes SPOTNIK, a system for training ML models that features a more adaptive design to accommodate transient VMs: (i) SPOTNIK uses an adaptive implementation of the all-reduce collective communication operation. As workers on transient VMs are revoked, SPOTNIK updates its membership and uses the all-reduce ring to recover; and (ii) SPOTNIK supports the adaptation of the synchronisation strategy between workers. This allows a training job to switch between different strategies in response to the revocation of transient VMs. Our experiments show that, after VM revocation, SPOTNIK recovers training within 300 ms for ResNet/ImageNet.

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at HotCloud 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

26/04/2020

SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum

Jianyu Wang, Vinayak Tantia, Nicolas Ballas, Michael Rabbat

Keywords Paper

distributed optimization, decentralized training methods, communication-efficient distributed training with momentum, large-scale parallel SGD

0

0

0

0

5:07

06/12/2020

AutoSync: Learning to Synchronize for Data-Parallel Distributed Deep Learning

Hao Zhang, Yuan Li, Zhijie Deng and
Xiaodan Liang, Lawrence Carin, Eric Xing

Keywords Paper

0

0

0

0

3:32

26/08/2020

FedPAQ: A Communication-Efficient Federated Learning Method with Periodic Averaging and Quantization

Amirhossein Reisizadeh, Aryan Mokhtari, Hamed Hassani and
Ali Jadbabaie, Ramtin Pedarsani

Keywords Paper

0

0

0

0

18:22

06/12/2020

Ensemble Distillation for Robust Model Fusion in Federated Learning

Tao Lin, Lingjing Kong, Sebastian Stich, Martin Jaggi

Keywords Paper

0

0

0

0

2:59

06/12/2021

Fast Federated Learning in the Presence of Arbitrary Device Unavailability

Xinran Gu, Kaixuan Huang, Jingzhao Zhang, Longbo Huang

Keywords Paper

optimization, privacy, federated learning

0

0

0

0

15:00

13/07/2020

Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems

Jeff Zhang, Sameh Elnikety, Shuayb Zarar and
Atul Gupta, Siddharth Garg

Keywords Paper

0

0

0

0

17:05

02/02/2021

Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees

Vyacheslav Kungurtsev, Malcolm Egan, Bapi Chatterjee, Dan Alistarh

Keywords Paper

0

0

0

0

19:56

03/08/2020

Genuinely distributed byzantine machine learning

El-Mahdi El-Mhamdi, Rachid Guerraoui, Arsany Guirguis and
Lê Nguyên Hoang, Sébastien Rouault

Keywords Paper

distributed machine learning, byzantine parameter servers, byzantine fault tolerance

1

1

0

1

22:05

13/04/2021

LENA: Communication-efficient distributed learning with self-triggered gradient uploads

Hossein Shokri Ghadikolaei, Sebastian Stich, Martin Jaggi

Keywords Paper

0

0

0

0

3:14

12/07/2020

Communication-Efficient Distributed Stochastic AUC Maximization with Deep Neural Networks

Zhishuai Guo, Mingrui Liu, Zhuoning Yuan and
Li Shen, Wei Liu, Tianbao Yang

Keywords Paper

Optimization - Large Scale, Parallel and Distributed

0

0

0

0

14:42

06/12/2021

BatchQuant: Quantized-for-all Architecture Search with Robust Quantizer

Haoping Bai, Meng Cao, Ping Huang, Jiulong Shan

Keywords Paper

deep learning, optimization

0

0

0

0

4:12

12/07/2020

Decoupled Greedy Learning of CNNs

Eugene Belilovsky, Michael Eickenberg, Edouard Oyallon

Keywords Paper

Optimization - Large Scale, Parallel and Distributed

0

0

0

0

16:04

13/07/2020

On the Impact of Isolation Costs on Locality-aware Cloud Scheduling

Ankit Bhardwaj, Meghana Gupta, Ryan Stutsman

Keywords Paper

0

0

0

0

16:44

23/08/2020

AutoFIS: Automatic feature interaction selection in factorization models for click-through rate prediction

Bin Liu, Chenxu Zhu, Guilin Li and
Weinan Zhang, Jincai Lai, Ruiming Tang, Xiuqiang He, Zhenguo Li, Yong Yu

Keywords Paper

feature selection, neural architecture search, recommendation, factorization machine

0

0

0

0

19:23

14/06/2020

Conditional Channel Gated Networks for Task-Aware Continual Learning

Davide Abati, Jakub Tomczak, Tijmen Blankevoort and
Simone Calderara, Rita Cucchiara, Babak Ehteshami Bejnordi

Keywords Paper

continual learning, channel gating, conditional computation, incremental learning, lifelong learning, hard attention

0

0

0

0

5:01

19/08/2021

Decomposable-Net: Scalable Low-Rank Compression for Neural Networks

Atsushi Yaguchi, Taiji Suzuki, Shuhei Nitta and
Yukinobu Sakata, Akiyuki Tanizawa

Keywords Paper

Machine Learning, Deep Learning, Statistical Methods and Machine Learning, Recognition, 2D and 3D Computer Vision

0

0

0

0

10:40

06/12/2021

Learning with Algorithmic Supervision via Continuous Relaxations

Felix Petersen, Christian Borgelt, Hilde Kuehne, Oliver Deussen

Keywords Paper

deep learning

0

0

0

0

11:39

14/06/2020

Online Deep Clustering for Unsupervised Representation Learning

Xiaohang Zhan, Jiahao Xie, Ziwei Liu and
Yew-Soon Ong, Chen Change Loy

Keywords Paper

unsupervised representation learning, self-supervised learning, clustering, unsupervised learning, unlabeled data, recognition, low-shot, classification, imagenet, feature

0

0

0

0

1:00

12/08/2020

Temporal System Call Specialization for Attack Surface Reduction

Seyedhamed Ghavamnia, Tapti Palit, Shachee Mishra, Michalis Polychronakis

Keywords Paper

0

0

0

0

11:56

22/06/2020

Interaction is necessary for distributed learning with privacy or communication constraints

Yuval Dagan, Vitaly Feldman

Keywords Paper

constrained learning, differential privacy, distributed learning, Learning theory

0

0

0

0

23:44

18/07/2021

Data-Free Knowledge Distillation for Heterogeneous Federated Learning

Zhuangdi Zhu, Junyuan Hong, Jiayu Zhou

Keywords Paper

Algorithms

0

1

0

0

5:15

06/12/2021

Self-Supervised Learning of Event-Based Optical Flow with Spiking Neural Networks

Jesse Hagenaars, Federico Paredes-Valles, Guido de Croon

Keywords Paper

deep learning, optimization, self-supervised learning

0

0

0

0

13:28

06/12/2020

Task-Agnostic Online Reinforcement Learning with an Infinite Mixture of Gaussian Processes

Mengdi Xu, Wenhao Ding, Jiacheng Zhu and
ZUXIN LIU, Baiming Chen, Ding Zhao

Keywords Paper

0

0

0

0

3:21

06/12/2021

Asynchronous Decentralized SGD with Quantized and Local Updates

Giorgi Nadiradze, Amirmojtaba Sabour, Peter Davies and
Shigang Li, Dan Alistarh

Keywords Paper

optimization, machine learning, graph learning

0

0

0

0

12:37

15/06/2020

Acclaim: Adaptive Memory Reclaim to Improve User Experience in Android Systems

Yu Liang, Jinheng Li, Rachata Ausavarungnirun and
Riwei Pan, Liang Shi, Tei-Wei Kuo, Chun Jason Xue

Keywords Paper

0

0

0

0

21:23

06/12/2021

Encoding Robustness to Image Style via Adversarial Feature Perturbations

Manli Shu, Zuxuan Wu, Micah Goldblum, Tom Goldstein

Keywords Paper

deep learning, machine learning, robustness, adversarial robustness and security, domain adaptation

0

0

0

0

7:36

06/12/2021

Neural Scene Flow Prior

Xueqian Li, Jhony Kaesemodel Pontes, Simon Lucey

Keywords Paper

deep learning, optimization, vision

0

0

0

0

14:09

04/11/2020

Thunderbolt: Throughput-Optimized, Quality-of-Service-Aware Power Capping at Scale

Shaohong Li, Xi Wang, Xiao Zhang and
Vasileios Kontorinis, Sreekumar Kodakara, David Lo, Parthasarathy Ranganathan

Keywords Paper

0

0

0

0

19:41

05/04/2021

FirePlace: Placing Firecraker Virtual Machines with Hindsight Imitation

Bharathan Balaji, Christopher Kakovitch, Balakrishnan Narayanaswamy

Keywords Paper

0

0

0

0

4:45

05/04/2021

FirePlace: Placing Firecraker Virtual Machines with Hindsight Imitation

Bharathan Balaji, Christopher Kakovitch, Balakrishnan Narayanaswamy

Keywords Paper

0

0

0

0

23:27

03/08/2020

99% of Worker-Master Communication in Distributed Optimization Is Not Needed

Konstantin Mishchenko, Filip Hanzely, Peter Richtarik

Keywords Paper

1

0

0

1

8:23

26/04/2020

SpikeGrad: An ANN-equivalent Computation Model for Implementing Backpropagation with Spikes

Johannes C. Thiele, Olivier Bichler, Antoine Dupret

Keywords Paper

spiking neural network, neuromorphic engineering, backpropagation

0

0

0

0

5:21

04/11/2020

From Global to Local Quiescence: Wait-Free Code Patching of Multi-Threaded Processes

Florian Rommel, Christian Dietrich, Daniel Friesel and
Marcel Köppen, Christoph Borchert, Michael Müller, Olaf Spinczyk, Daniel Lohmann

Keywords Paper

0

0

0

0

18:59

26/04/2020

Once for All: Train One Network and Specialize it for Efficient Deployment

Han Cai, Chuang Gan, Tianzhe Wang and
Zhekai Zhang, Song Han

Keywords Paper

Efficient Deep Learning, Specialized Neural Network Architecture, AutoML

0

0

0

0

4:53

06/12/2020

Differentiable Neural Architecture Search in Equivalent Space with Exploration Enhancement

Miao Zhang, Huiqi Li, Shirui Pan and
Xiaojun Chang, Zongyuan Ge, Steven Su

Keywords Paper

0

0

0

0

3:22

13/04/2021

Towards flexible device participation in federated learning

Yichen Ruan, Xiaoxi Zhang, Shu-Che Liang, Carlee Joe-Wong

Keywords Paper

0

0

0

0

3:01

13/04/2021

A linearly convergent algorithm for decentralized optimization: Sending less bits for free!

Dmitry Kovalev, Anastasia Koloskova, Martin Jaggi and
Peter Richtarik, Sebastian Stich

Keywords Paper

0

0

0

0

3:07

18/07/2021

Byzantine-Resilient High-Dimensional SGD with Local Iterations on Heterogeneous Data

Deepesh Data, Suhas Diggavi

Keywords Paper

Social Aspects of Machine Learning, Privacy, Anonymity, and Security

0

0

0

1

5:12

26/08/2020

Decentralized gradient methods: does topology matter?

Giovanni Neglia, Chuan Xu, Don Towsley, Gianmarco Calbi

Keywords Paper

0

0

0

0

15:00

06/12/2021

Only Train Once: A One-Shot Neural Network Training And Pruning Framework

Tianyi Chen, Bo Ji, Tianyu Ding and
Biyi Fang, Guanyi Wang, Zhihui Zhu, Luming Liang, Yixin Shi, Sheng Yi, Xiao Tu

Keywords Paper

deep learning, optimization, reinforcement learning and planning

0

0

0

0

12:53