ACM SIGCOMM 2021, virtually (online)
MENU

ACM SIGCOMM 2021 TUTORIAL: Traffic Engineering in Cloud WANs

Tutorial Program

The tutorial has an associated Slack channel for discussions. Click on the link below to visit it. If you're asked to sign in, use the workspace name "sigcomm.slack.com" to sign up or sign in.

Go to Tutorial Slack channel
  • Monday, August 23th 13:40-17:00 (UTC-4, New York), 19:40-23:00 (UTC+2, Paris)

  • 1:40 pm - 3:00 pm Traffic Engineering: Research and Practice

  • 1:40 pm - 2:00 pm

    Introduction to Cloud traffic engineering.

    Introduction to cloud TE and recent advances in TE research (by Rachee Singh and Nikolaj Bjørner).

  • 2:00 pm - 3:00 pm

    Panel discussion on TE deployments at Google, Facebook and Microsoft.

    Industry practioners (Chi-Yao Hong from Google, Brandon Schlinker from Facebook and Umesh Krishnaswamy from Microsoft) discuss key challenges and developments in deployed TE systems.

  • 3:00 pm - 3:30 pm Coffee/tea Break

  • 3:30 pm - 5:00 pm Learning-based TE and Hands-on TE algorithm development

  • 3:30 pm - 4:00 pm

    Learning-based approach to traffic engineering

    Introduction to learning-based traffic engineering approaches by Michael Schapira.

  • 4:00 pm - 5:00 pm

    Hands-on TE algorithm development

    We will provide code to work through the process of implementing important TE algorithms in Python and solving them using realistic network data.

Call For Participation

For over a decade, cloud providers have leveraged traffic engineering (TE) systems and algorithms to improve the efficiency of their wide-area networks (WANs). Google's B4[1] and Microsoft's SWAN[2] are two important intra-WAN TE systems that have improved the utilizations of expensive inter-datacenter links through software-defined and centralized traffic engineering. Since their inception, these systems continue to evolve in both architecture and functionality to incorporate the changing characteristics of cloud traffic. In fact, both academics and industry professionals have improved the original TE goals by developing TE systems to incorporate resilience to link failures[3], rapid changes in demands [4], low cost[5,6], and other desirable network characteristics[7].

In this tutorial, we will enable the attendees to get up to speed with the large body of work in cloud traffic engineering. This tutorial will have a hands-on session to develop several TE algorithms. We will provide both starter code in Jupyter notebooks and simulated traffic matrices in the tutorial. Through an invited panel discussion with industry professionals who are actively involved in building and maintaining large-scale TE systems, this tutorial will inform the community of the pressing problems in the space of cloud traffic engineering.

Outline

This tutorial consists in two lectures and two hands-on labs. The first lecture will introduce cloud TE and recent advances in TE systems. The second lecture will cover learning-based approaches to traffic engineering.

Aside from the two lectures, the tutorial will have a live panel discussion with industry practioners from Google, Facebook and Microsoft to discuss the challenges faced by large-scale TE deployments today. Attendees are welcome ask questions from the panelists. Finally, the tutorial will end with a hands-on session where we will implement TE algorithms from popular research works in this space.

A rough outline follows:

  • Session I In this part of the tutorial we plan to cover the foundations of cloud TE through the introductory lecture. This will be followed by an in-depth discussion on challenges faced by practioners who deploy and maintain large TE systems.

  • Session II In the second part of the tutorial we will discuss recent learning-based approaches towards TE and a hands-on session to learn how to implement and solve state-of-the-art TE algorithms.

Audience Expectations and Prerequisites

The hands-on portion of this tutorial will use Python3 and a convex optimization library, CVXPY that installs automatically using pip. Several days before the tutorial, we will share a Github repository with the Jupyter notebooks containing starter code. The hands-on portion is predicated upon Jupyter notebook environments being widely used and easy to access or install.

Organizers

  • Rachee Singh

    Microsoft

    • Bio:

      Rachee Singh is a researcher in the office of the CTO at Azure for Operators. Her work improves the cost and performance of wide-area networks across the layers of the networking stack. She holds a Ph.D. from the University of Massachusetts, Amherst and was a 2018 Systems and Networking Google PhD fellow. She was selected as a rising star in networking and communications by N2Women. In a previous life, she implemented routing protocols for Ethernet switches at Arista Networks.


  • Nikolaj Bjørner

    Microsoft Research

    • Bio:

      Nikolaj Bjørner is a researcher at Microsoft Research, Redmond, working in the area of automated theorem proving and Software Engineering. His main line of work is around the SMT solver Z3 with applications in Network Verification. His work around Z3 has been the basis for awards at ACM SIGPLAN, ETAPS, TACAS and the CADE Herbrand Award. He received his Ph.D. degree in computer science from Stanford University.


  • Michael Schapira

    Hebrew University of Jerusalem

    • Bio:

      Michael Schapira is a Professor of Computer Science at the Hebrew University of Jerusalem. Prior to joining Hebrew U, he was a postdoctoral researcher at UC Berkeley, Yale, and Princeton and a visiting scientist at Google NYC. His current research interests lie at the intersection of computer networking and machine learning. has been awarded the Wolf Foundation's Krill Prize, faculty research awards from Microsoft, Google, and Facebook, the ERC Starting Grant, IETF/IRTF Applied Networking Research Prizes, and the IEEE Communications Society William R. Bennett Prize. He holds a B.Sc. in Math and CS, a B.A. in Humanities, and a PhD in CS from the Hebrew U.


References

[1] Sushant Jain, Alok Kumar, Subhasree Mandal, Joon Ong, Leon Poutievski, Arjun Singh, Subbaiah Venkata, Jim Wanderer, Junlan Zhou, Min Zhu, Jon Zolla, Urs Hölzle, Stephen Stuart, and Amin Vahdat. “B4: Experience with a Globally-deployed Software Defined Wan”. In ACM SIGCOMM (2013).

[2] Chi-Yao Hong, Srikanth Kandula, Ratul Mahajan, Ming Zhang, Vijay Gill, Mohan Nanduri, and Roger Wattenhofer. “Achieving High Utilization with Software-driven WAN”. In ACM SIGCOMM (2013).

[3] Hongqiang Harry Liu, Srikanth Kandula, Ratul Mahajan, Ming Zhang, and David Gelernter. “Traffic engineering with forward fault correction”. In ACM SIGCOMM (2014).

[4] David Applegate and Edith Cohen. “Making Intra-Domain Routing Robust to Changing and Uncertain Traffic Demands: Understanding Fundamental Tradeoffs”. In ACM SIGCOMM (2003).

[5] Rachee Singh, Nikolaj Bjørner, Sharon Shoham, Yawei Yin, John Arnold, and Jamie Gaudette. "Cost-effective capacity provisioning in wide area networks with Shoofly". In ACM SIGCOMM (2021).

[6] Rachee Singh, Sharad Agarwal, Matt Calder, Paramvir Bahl . “Cost-effective cloud edge traffic engineering with Cascara”. In USENIX NSDI (2021).

[7] Rachee Singh, Manya Ghobadi, Klaus-Tycho Foerster, Mark Filer, and Phillipa Gill. “RADWAN: Rate Adaptive Wide Area Network”. In ACM SIGCOMM (2018).