Abstract:
Persistent packet loss in the cloud-scale overlay network severely compromises tenant experiences. Cloud providers are keen to automatically and quickly determine the root cause of such problems. However, existing work is either designed for the physical network or insufficient to present the concrete reason of packet loss. In this paper, we propose to record and analyze the on-site forwarding condition of packets during packet-level tracing. The cloud-scale overlay network presents great challenges to achieve this goal with its high network complexity, multi-tenant nature, and diversity of root causes. To address these challenges, we present VTrace, an automatic diagnostic system for persistent packet loss over the cloud-scale overlay network. Utilizing the "fast path-slow path" structure of virtual forwarding devices (VFDs), e.g., vSwitches, VTrace installs several "coloring, matching and logging" rules in VFDs to selectively track the packets of interest and inspect them in depth. The detailed forwarding situation at each hop is logged and then assembled to perform analysis with an efficient path reconstruction scheme. Experiments are conducted to demonstrate VTrace’s low overhead and quick responsiveness. We share experiences of how VTrace efficiently resolves persistent packet loss issues after deploying it in Alibaba Cloud for over 20 months.