ACM SIGCOMM 2021, virtually (online)

ACM SIGCOMM 2021 TUTORIAL: Network-Accelerated Distributed Deep Learning

Tutorial Program (subject to changes)

The tutorial has an associated Slack channel for discussions. Click on the link below to visit it. If you're asked to sign in, use the workspace name "" to sign up or sign in.

Go to tutorial Slack channel
  • Friday, August 27th 13:00-17:00 (UTC-4, New York), 19:00-23:00 (UTC+2, Paris)

  • 1:15 pm - 3:00 pm Session I

  • 1:15 pm - 3:00 pm (1h 45m)

    Distributed training overview; When the network is the bottleneck; In-network aggregation: SwitchML.

    Hands-on: Running a distributed DL job; distributed training with SwitchML; the SwitchML API.

  • 3:00 pm - 3:30 pm Coffee/tea Break

  • 3:30 pm - 5:00 pm Session II

  • 3:30 pm - 5:00 pm (1h 30m)

    Sparse collective communication: OmniReduce; Gradient compression: GRACE.

    Hands-on: Distributed training with OmniReduce; Implement and run a gradient compression algorithm with GRACE.

Call For Participation

Distributed Deep Learning (DL) is an important workload that is becoming communication bound as the scale and size of DL models increases. This tutorial will present a range of techniques that are effective at mitigating the network communication bottleneck and accelerate the performance of distributed training.


Training Deep Neural Network (DNN) models in parallel on a distributed machine cluster is an emergent important workload and increasingly, communication bound. To be clear, it remains computationally intensive. But the last seven years have brought a 62× improvement in compute performance, thanks to GPUs and other hardware accelerators. Cloud network deployments have found this pace hard to match, skewing the ratio of computation to communication towards the latter. Meanwhile, in pursuit of ever higher accuracy, data scientists continue to enlarge model sizes and complexity as the underlying compute and memory capabilities allow them to.

The MLSys community has recently been tackling the communication challenges in distributed DNN training with various approaches ranging from efficient parameter servers [1,2] or scalable collective communication [3,4] to in-network aggregation [5] and gradient compression techniques [6,7]. The overarching goal of these works has been to alleviate the communication bottlenecks by reducing the time that workers spend on overall network communication to exchange the local gradients.

In this tutorial, we will present some of the state-of-the-art approaches, primarily focusing on our own work in the area. We hope the tutorial will familiarize the attendants with this timely area and stimulate discussions and new ideas.

A rough outline follows:

  • Session I In the first session, we will will provide an introduction on scaling distributed machine learning from a networking-centric perspective. After reviewing basic concepts in distributed DNN training, we will step through different solutions of how to accelerate network communication. We will start with in-network aggregation, describing SwitchML as an example of co-design of programmable switch-based processing and end-host protocols.

  • Session II In the second session, we will look into the properties of the traffic and exploit the sparsity of gradient values. We will describe OmniReduce, which evolves the concept of in-network aggregation and focuses on efficient collective operations for sparse data. We will close with lossy gradient compression techniques and the GRACE framework for implementing them.

Audience Expectations and Prerequisites

Anyone with basic understanding of networking and machine learning can participate in this tutorial. In order to benefit from the hands-on, they are asked to have access to a computer with Docker installed. More specific instructions TBA.


  • Marco Canini


    • Bio:

      Marco does not know what the next big thing will be. But he's sure that our next-gen computing and networking infrastructure must be a viable platform for it and avoid stifling innovation. Marco's research spans a number of areas in computer systems, including distributed systems, large-scale/cloud computing and computer networking with emphasis on programmable networks. His current focus is on designing better systems support for AI/ML and providing practical implementations deployable in the real-world.
      Marco is an associate professor in Computer Science at KAUST. Marco obtained his Ph.D. in computer science and engineering from the University of Genoa in 2009 after spending the last year as a visiting student at the University of Cambridge. He was a postdoctoral researcher at EPFL and a senior research scientist at Deutsche Telekom Innovation Labs & TU Berlin. Before joining KAUST, he was an assistant professor at UCLouvain. He also held positions at Intel, Microsoft and Google.

  • Jiawei Fei

    NUDT and KAUST

    • Bio:

      Jiawei works on accelerating distributed machine learning systems by maximizing effective network bandwidth. His interests are mainly in data center network and programmable switches. Jiawei is a Ph.D student in Computer Science, at NUDT and currently is a visiting student at KAUST.

  • Chen-Yu Ho


    • Bio:

      Chen-Yu works on developing efficient distributed machine learning systems, focusing on alleviating the network bandwidth bottleneck. His effort spans gradient compression techniques and system designs that exploit characteristics of machine learning workloads. Before joining KAUST as a Ph.D. student, Chen-Yu worked on digitalizing handwriting and ancient Chinese calligraphy arts at Academia Sinica, Taiwan.

  • Jacob Nelson

    Microsoft Research

    • Bio:

      Jacob's research explores how emerging datacenter hardware can be used to build faster and more efficient distributed systems. He joined the Systems Research Group at Microsoft Research's Redmond Lab in 2016 after completing his Ph. D. at the Allen School of Computer Science and Engineering at the University of Washington.

  • Amedeo Sapio


    • Bio:

      Amedeo is leading the effort to support in-network computation in programmable switches within Intel. His main research interests include high-speed packet processing, dataplane programming and innovative network services. Before joining Intel, he was a Software Engineer with the Cisco Data Center Switching Group, a PostDoctoral Researcher at KAUST and a Visiting Researcher at Narus. Amedeo obtained his Ph.D. in computer engineering from Politecnico di Torino.


[1] L. Luo, J. Nelson, L. Ceze, A. Phanishayee, and A. Krishnamurthy. PHub: Rack-Scale Parameter Server for Distributed Deep Neural Network Training. In SoCC, 2018.

[2] Y. Jiang, Y. Zhu, C. Lan, B. Yi, Y. Cui, and C. Guo. A Unified Architecturefor Accelerating Distributed DNN Training in Heterogeneous GPU/CPU Clusters. In OSDI, 2020.

[3] G. Wang, S. Venkataraman, A. Phanishayee, J. Thelin, N. Devanur, and I. Stoica. Blink: Fast and Generic Collectives for Distributed ML. In MLSys, 2020.

[4] J. Fei, C.-Y. Ho, A. N. Sahu, M. Canini, and A. Sapio. Efficient SparseCollective Communication and its application to Accelerate DistributedDeep Learning. In SIGCOMM, 2021.

[5] A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Krishnamurthy, M. Moshref, D. R. K. Ports, and P. Richtarik. Scaling Distributed Machine Learning with In-Network Aggregation. In NSDI, 2021.

[6] H. Xu, C.-Y. Ho, A. M. Abdelmoniem, A. Dutta, E. H. Bergou, K. Karatsenidis, M. Canini, and P. Kalnis. GRACE: A Compressed Communication Framework for Distributed Machine Learning. In ICDCS, 2021.

[7] A. M. Abdelmoniem, A. Elzanaty, M.-S. Alouini, and M. Canini. An Efficient Statistical-based Gradient Compression Technique for Distributed Training Systems. In MLSys, 2021.