Scouts: Improving the diagnosis process through domain-customized incident routing

11/08/2020

Scouts: Improving the diagnosis process through domain-customized incident routing

Jiaqi Gao, Nofel Yaseen, Robert MacDavid, Felipe Vieira Frujeri, Vincent Liu, Ricardo Bianchini, Ramaswamy Aditya, Xiaohang Wang, Henry Lee, David Maltz, Minlan Yu, Behnaz Arzani

Keywords: Data center networks, Machine learning, Diagnosis

Abstract Paper Similar Papers

Abstract: Incident routing is critical for maintaining service level objectives in the cloud: the time-to-diagnosis can increase by 10x due to mis-routings. Properly routing incidents is challenging because of the complexity of today’s data center (DC) applications and their dependencies. For instance, an application running on a VM might rely on a functioning host-server, remote-storage service, and virtual and physical network components. It is hard for any one team, rule-based system, or even machine learning solution to fully learn the complexity and solve the incident routing problem. We propose a different approach using per-team Scouts. Each teams’ Scout acts as its gate-keeper — it routes relevant incidents to the team and routes-away unrelated ones. We solve the problem through a collection of these Scouts. Our PhyNet Scout alone — currently deployed in production — reduces the time-to-mitigation of 65

0

0

0

0

Share

This is an embedded video. Talk and the respective paper are published at SIGCOMM 2020 virtual conference. If you are one of the authors of the paper and want to manage your upload, see the question "My papertalk has been externally embedded..." in the FAQ section.

Comments

Post Comment

no comments yet

Similar Papers

11/08/2020

Flow event telemetry on programmable data plane

Yu Zhou, Chen Sun, Hongqiang Harry Liu and
Rui Miao, Shi Bai, Bo Li, Zhilong Zheng, Lingjun Zhu, Zhen Shen, Yongqing Xi, Pengcheng Zhang, Dennis Cai, Ming Zhang, Mingwei Xu

Keywords Paper

monitoring, programmable data plane, Flow event telemetry

0

0

0

0

18:24

23/08/2020

DeepTriage: Automated transfer assistance for incidents in cloud services

Phuong Pham, Vivek Jain, Lukas Dauterman and
Justin Ormont, Navendu Jain

Keywords Paper

incident management, deep learning, transfer assistant, incident transfer, incident triage

0

0

0

0

10:44

11/08/2020

VTrace: Automatic diagnostic system for persistent packet loss in cloud-scale overlay network

Chongrong Fang, Haoyu Liu, Mao Miao and
Jie Ye, Lei Wang, Wansheng Zhang, Daxiang Kang, Biao Lyv, Peng Cheng, Jiming Chen

Keywords Paper

Cloud-scale overlay network, Network diagnosis

0

0

0

0

18:23

05/04/2021

FirePlace: Placing Firecraker Virtual Machines with Hindsight Imitation

Bharathan Balaji, Christopher Kakovitch, Balakrishnan Narayanaswamy

Keywords Paper

0

0

0

0

23:27

05/04/2021

FirePlace: Placing Firecraker Virtual Machines with Hindsight Imitation

Bharathan Balaji, Christopher Kakovitch, Balakrishnan Narayanaswamy

Keywords Paper

0

0

0

0

4:45

04/11/2020

Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions

Sebastien Levy, Randolph Yao, Youjiang Wu and
Yingnong Dang, Peng Huang, Zheng Mu, Pu Zhao, Tarun Ramani, Naga Govindaraju, Xukun Li, Qingwei Lin, Gil Lapid Shafriri, Murali Chintalapati

Keywords Paper

0

0

0

0

18:04

11/08/2020

BeauCoup: Answering many network traffic queries, one memory update at a time

Xiaoqi Chen, Shir Landau-Feibish, Mark Braverman, Jennifer Rexford

Keywords Paper

Distinct Counting, Streaming Algorithm, Network Measurement, Data Plane, Sketching, Programmable Switch

0

0

0

0

15:07

04/11/2020

Aragog: Scalable Runtime Verification of Shardable Networked Systems

Nofel Yaseen, Behnaz Arzani, Ryan Beckett and
Selim Ciraci, Vincent Liu

Keywords Paper

0

0

0

0

18:49

13/07/2020

Stratus: Clouds with Microarchitectural Resource Management

Kaveh Razavi, Animesh Trivedi

Keywords Paper

0

0

0

0

14:31

06/12/2021

Bandit Quickest Changepoint Detection

Aditya Gopalan, Braghadeesh Lakshminarayanan, Venkatesh Saligrama

Keywords Paper

reinforcement learning and planning, bandits

0

0

0

0

15:01

03/05/2021

Characterizing signal propagation to close the performance gap in unnormalized ResNets

Andrew Brock, Soham De, Samuel Smith

Keywords Paper

neural networks, ConvNets, deep learning, CNNs, EfficientNets, ResNets, signal propagation, normalizers, ImageNet

0

0

0

0

5:01

14/06/2020

Weakly Supervised Semantic Point Cloud Segmentation: Towards 10× Fewer Labels

Xun Xu, Gim Hee Lee

Keywords Paper

3d point cloud, semantic segmentation, weakly supervised learning, semi-supervised learning

0

0

0

0

1:01

14/06/2020

EcoNAS: Finding Proxies for Economical Neural Architecture Search

Dongzhan Zhou, Xinchi Zhou, Wenwei Zhang and
Chen Change Loy, Shuai Yi, Xuesen Zhang, Wanli Ouyang

Keywords Paper

neural architecture search, evaluation proxy, acceleration, evolutionary algorithm, image recognition

0

0

0

0

1:01

14/09/2020

Learning I/O Access patterns to Improve Prefetching in SSDs

Chandranil Chakraborttii, Heiner Litz

Keywords Paper

prefetching, neural network, flash

0

0

0

0

14:20

13/07/2020

Model-Switching: Dealing with Fluctuating Workloads in Machine-Learning-as-a-Service Systems

Jeff Zhang, Sameh Elnikety, Shuayb Zarar and
Atul Gupta, Siddharth Garg

Keywords Paper

0

0

0

0

17:05

26/04/2020

Gap-Aware Mitigation of Gradient Staleness

Saar Barkai, Ido Hakimi, Assaf Schuster

Keywords Paper

distributed, asynchronous, large scale, gradient staleness, staleness penalization, sgd, deep learning, neural networks, optimization

0

0

0

0

4:36

03/05/2021

MetaNorm: Learning to Normalize Few-Shot Batches Across Domains

Yingjun Du, Xiantong Zhen, Ling Shao, Cees G Snoek

Keywords Paper

batch normalization, Meta-learning, few-shot domain generalization

0

0

0

0

5:48

15/06/2020

POSH: A Data-Aware Shell

Deepti Raghavan, Sadjad Fouladi, Philip Levis, Matei Zaharia

Keywords Paper

0

0

0

0

21:36

26/04/2020

Computation Reallocation for Object Detection

Feng Liang, Chen Lin, Ronghao Guo and
Ming Sun, Wei Wu, Junjie Yan, Wanli Ouyang

Keywords Paper

Neural Architecture Search, Object Detection

0

0

0

0

5:29

14/06/2020

Enhancing Cross-Task Black-Box Transferability of Adversarial Examples With Dispersion Reduction

Yantao Lu, Yunhan Jia, Jianyu Wang and
Bai Li, Weiheng Chai, Lawrence Carin, Senem Velipasalar

Keywords Paper

adversarial example, black-box attack, cross tasks, transferability, deep neural network

0

0

0

0

1:01

05/01/2021

Weakly-Supervised Object Representation Learning for Few-Shot Semantic Segmentation

Xiaowen Ying, Xin Li, Mooi Choo Chuah

Keywords Paper

0

0

0

0

5:00

12/07/2020

Non-Stationary Bandits with Intermediate Observations

Claire Vernade, András György, Timothy Mann

Keywords Paper

Online Learning, Active Learning, and Bandits

1

1

0

0

14:40

14/06/2020

Conditional Channel Gated Networks for Task-Aware Continual Learning

Davide Abati, Jakub Tomczak, Tijmen Blankevoort and
Simone Calderara, Rita Cucchiara, Babak Ehteshami Bejnordi

Keywords Paper

continual learning, channel gating, conditional computation, incremental learning, lifelong learning, hard attention

0

0

0

0

5:01

19/10/2020

LogBug: Generating adversarial system logs in real time

Jingyu Sun, Bingyu Liu, Yuan Hong

Keywords Paper

system log analysis, real time, black-box attack

0

0

0

0

6:46

06/12/2021

Streaming Belief Propagation for Community Detection

Yuchen Wu, Jakab Tardos, Mohammadhossein Bateni and
André Linhares, Filipe Miguel Goncalves de Almeida, Andrea Montanari, Ashkan Norouzi-Fard

Keywords Paper

0

0

0

0

14:01

04/11/2020

Toward a Generic Fault Tolerance Technique for Partial Network Partitioning

Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, Samer Al-Kiswany

Keywords Paper

0

0

0

0

18:11

02/02/2021

A User-Adaptive Layer Selection Framework for Very Deep Sequential Recommender Models

Lei Chen, Fajie Yuan, Jiaxi Yang and
Xiang Ao, Chengming Li, Min Yang

Keywords Paper

0

0

0

0

18:18

11/08/2020

Microscope: Queue-based performance diagnosis for network functions

Junzhi Gong, Yuliang Li, Bilal Anwer and
Aman Shaikh, Minlan Yu

Keywords Paper

NFV, diagnosis, performance

0

0

0

0

18:59

26/04/2020

Prediction Poisoning: Towards Defenses Against DNN Model Stealing Attacks

Tribhuvanesh Orekondy, Bernt Schiele, Mario Fritz

Keywords Paper

model functionality stealing, adversarial machine learning

0

0

0

0

4:53

11/08/2020

Zero downtime release: Disruption-free load balancing of a multi-billion user website

Usama Naseer, Luca Niccolini, Udip Pant and
Alan Frindell, Ranjeeth Dasineni, Theophilus A. Benson

Keywords Paper

Reliable networks, Update releases, Load-balancing

0

0

0

0

17:50

13/04/2021

Continual learning using a bayesian nonparametric dictionary of weight factors

Nikhil Mehta, Kevin Liang, Vinay Kumar Verma, Lawrence Carin Duke

Keywords Paper

0

0

0

0

2:58

02/02/2021

Knowledge-aware Leap-LSTM: Integrating Prior Knowledge into Leap-LSTM towards Faster Long Text Classification

Jinhua Du, Yan Huang, Karo Moilanen

Keywords Paper

0

0

0

0

19:11

12/08/2020

(Mostly) Exitless VM Protection from Untrusted Hypervisor through Disaggregated Nested Virtualization

Zeyu Mi, Dingji Li, Haibo Chen and
Binyu Zang, Haibing Guan

Keywords Paper

0

0

0

0

12:07

06/12/2020

Limits on Testing Structural Changes in Ising Models

Aditya Gangrade, Bobak Nazer, Venkatesh Saligrama

Keywords Paper

0

0

0

0

3:27

04/11/2020

Virtual Consensus in Delos

Mahesh Balakrishnan, Jason Flinn, Chen Shen and
Mihir Dharamshi, Ahmed Jafri, Xiao Shi, Santosh Ghosh, Hazem Hassan, Aaryaman Sagar, Rhed Shi, Jingming Liu, Filip Gruszczynski, Xianan Zhang, Huy Hoang, Ahmed Yossef, Francois Richard, Yee Jiun Song

Keywords Paper

0

0

0

0

17:49

14/06/2020

HyperSTAR: Task-Aware Hyperparameters for Deep Networks

Gaurav Mittal, Chang Liu, Nikolaos Karianakis and
Victor Fragoso, Mei Chen, Yun Fu

Keywords Paper

auto ml, hyperparameter optimization, meta learning, task aware, hyperband, hyperparameters, warm start, image classication, resnet, shufflenet

0

0

0

0

4:58

11/08/2020

OmniMon: Re-architecting network telemetry with resource efficiency and full accuracy

Qun Huang, Haifeng Sun, Patrick P. C. Lee and
Wei Bai, Feng Zhu, Yungang Bao

Keywords Paper

Distributed systems, Network measurement

0

0

0

0

20:10

19/01/2020

Abstract Interpretation of Distributed Network Control Planes

Ryan Beckett, Aarti Gupta, Ratul Mahajan, David Walker

Keywords Paper

Abstract Interpretation, Distributed Routing Protocols, Network Control Plane, Network Verification, Router Configuration Verification, Network Simulation, Network Reliability

0

0

0

0

20:41

04/11/2020

Unearthing inter-job dependencies for better cluster scheduling

Andrew Chung, Subru Krishnan, Konstantinos Karanasos and
Carlo Curino, Gregory R. Ganger

Keywords Paper

0

0

0

0

18:45

11/08/2020

MasQ: RDMA for virtual private cloud

Zhiqiang He, Dongyang Wang, Binzhang Fu and
Kun Tan, Bei Hua, Zhi-Li Zhang, Kai Zheng

Keywords Paper

Network virtualization, RDMA, Datacenter network

0

0

0

0

19:58