Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Abstract: Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs have been increasingly deployed to train deep learning models. These accelerators exhibit heterogeneous performance behavior across model architectures. Existing schedulers for clusters of accelerators, which are used to arbitrate these expensive training resources across many users, have shown how to optimize for various multi-job, multi-user objectives, like fairness and makespan. Unfortunately, existing schedulers largely do not consider performance heterogeneity. In this paper, we propose Gavel, a heterogeneity-aware scheduler that systematically generalizes a wide range of existing scheduling policies. Gavel expresses these policies as optimization problems and then systematically transforms these problems into heterogeneity-aware versions using an abstraction we call effective throughput. Gavel then uses a round-based scheduling mechanism to ensure jobs receive their ideal allocation given the target scheduling policy. Gavel's heterogeneity-aware policies allow a heterogeneous cluster to sustain higher input load, and improve end objectives such as makespan and average job completion time by 1.4x and 3.5x compared to heterogeneity-agnostic policies.

05/04/2021

David Durst, Matthew Feldman, Dillon Huff and
David Akeley, Ross Daly, Gilbert Louis Bernstein, Marco Patrignani, Kayvon Fatahalian, Pat Hanrahan

Baohe Zhang, Raghu Rajan, Luis Pineda and
Nathan Lambert, André Biedenkapp, Kurtland Chua, Frank Hutter, Roberto Calandra

Anush Sankaran, Olivier Mastropietro, Ehsan Saboori and
Yasser Idris, Davis Sawyer, MohammadHossein AskariHemmat, Ghouthi Boukli Hacene

Jie Zhao, Bojie Li, Wang Nie and
Zhen Geng, Renwei Zhang, Xiong Gao, Bin Cheng, Chen Wu, Yun Cheng, Zheng Li, Peng Di, Kun Zhang, Xuefeng Jin

Lingxiao Ma, Zhiqiang Xie, Zhi Yang and
Jilong Xue, Youshan Miao, Wei Cui, Wenxiang Hu, Fan Yang, Lintao Zhang, Lidong Zhou

Ryan Senanayake, Changwan Hong, Ziheng Wang and
Amalee Wilson, Stephen Chou, Shoaib Kamil, Saman Amarasinghe, Fredrik Kjolstad

span, greedy scheduling, interactive applications, performance bounds, priority inversion, responsive parallelism, adaptive scheduling, work, scheduling

19:59

14/06/2020

riemannian optimization, symmetric positive definite (spd) manifolds, optimization-based meta-learning, automatical spd optimizer design, learning to optimize, gradiend-based spd optimization, optimization problems with spd constraints

0:50