MegaTE: Extending WAN Traffic Engineering to Millions of Endpoints in Virtualized Cloud

Abstract

In today’s virtualized cloud, containers and virtual machines (VMs) are prevailing methods to deploy applications with different tenant requirements. However, these requirements are at odds with the resource allocation capabilities of conventional networking stacks in wide-area networks (WANs). In particular, existing WAN traffic engineering (TE) systems, focused on optimizing link utilization and minimizing congestion at the granularity of aggregated traffic flows, are not designed to cater to each individual flow. In this paper, we advocate for a radical new approach to extend TE systems to involve millions of virtual instance endpoints. We propose and implement a first-of-its-kind system, called MegaTE, to satisfy the needs of each fine-grained traffic flow at the virtual instance level. At the core of the MegaTE system is the paradigm shift from the top-down centralized control to the bottom-up asynchronous query in the TE control loop, combined with eBPF-based segment routing on the data plane and TE optimization contraction on the control plane. We evaluate MegaTE using flow-level simulations with production traffic traces. Our results show that MegaTE supports 20× more endpoints with the similar algorithm run time compared to prior work. MegaTE has been adopted by large-scale public cloud providers. Notably, Tencent rolled out MegaTE in its cloud WAN since December 2022. Our production analysis shows that MegaTE reduces the packet latency of real-time applications by up to 51%.

Publication
To apper in ACM SIGCOMM’24