Half-day Tutorial: Software-Driven Network Fabrics for AI Systems

Monday, September 8th

Location

The tutorial will take place at Room 17 Abril (DMUC). Registration and lunch will take place at São Francisco Convent. A bus will depart from the Convent to DMUC at 13h45 and will return at 18h00 so you can enjoy the welcome reception.

Presenters

Yilong Geng (Clockwork Systems)
Balaji Prabhakar (Balaji Prabhakar & Clockwork Systems)
Adi Gangidi (Meta)
Rohit Puri (Meta)

Tutorial Timetable

14:00 — 14:20	Overview of tutorial Network fabric architectures and technologies for DCs and AI clusters—fat tree, multipathing, RoCEv2, Infiniband, UEC. Fundamental requirements: high-performance, high reliability, ease of deployment and operation, functioning in heterogeneous environments. Basic approaches: network-centric and edge+network-centric, their comparison and synergies.
14:20 — 15:00	Software-Driven Fabrics Definition and outline of approach for designing SDFs. Software-based high accuracy clock synchronization and its use in sensing and controlling network congestion and contention. Overview of hardware-based clock sync algorithms (PTP, PPS) and software-based clock sync algorithms (NTP and Huygens). Congestion detection and congestion elimination using Huygens and On-Ramp.
15:00 — 15:15	Demos of Huygens and On-Ramp in public clouds.
15:15 — 15:45	Introduction to GPU clusters: Traffic characteristics, failure modes and control mechanisms Basics of AI workloads and performance requirements. Introduction to Collective Communication Libraries (CCLs) and Remote Direct Memory Access (RDMA). Failure modes and impairments: (a) optical link failures in high-bandwidth GPU network links and their effect on job completion times and (b) fabric contention (loss of throughput due to poor load balancing) vs fabric congestion (due to oversubscribed internal links or receiving NICs). Network-centric methods for data transport in AI clusters.
15:45 — 16:15	Afternoon coffee break
16:15 — 17:00	Software-driven AI clusters A unified approach to 3 different network technologies: TCP+Ethernet, RoCEv2/Infiniband+*CCL, RoCEv2/Infiniband+lib-IBverbs for storage traffic. Software-based congestion sensing algorithms—fleet (out-of-band) and workload (in-band) monitoring toolkits. Fabric contention elimination via OWD-based traffic load balancing. Software-driven fabric congestion control algorithms using end-to-end OWDs for RDMA storage traffic. The Ghost Buffer and its use in creating “bandwidth headroom” and “bandwidth slicing”.
17:00 — 17:30	SD Fabrics using Hardware Support (a) Using congestion control algorithms in NICs to offload control of RDMA traffic. (b) INT for inferring network topology and link- and path-level utilizations.
17:30 — 18:00	Case studies Results and live demos showing the effectiveness of monitoring, resiliency and performance acceleration algorithms in large GPU clusters: (1) in on-premises fleet running AI jobs, (2) in public clouds running CCL tests, and (3) RDMA storage traffic.

Summary

The explosive increase in AI training andinference over the past 5 years and the attendant demand for high-performanceGPU interconnection fabrics has led to a fundamental reconsideration of all themajor components of such fabrics: From the network interface cards (NICs) tothe switching layer, from the interconnect technology (Ethernet vs RoCE vsInfiniband) to the transport protocols (TCP vs multipath RoCEv2 vs RDMA), every aspect of AI networks is undergoing an intense rethinking andredesign. The first phase of thistransformation saw interconnection networks designed/assembled for large-scaletraining of foundational AI models. Inference was primarily performed on one ora small number of GPUs. While largelysuccessful, this approach hinged on homogeneity—of GPU type and capability, ofnetwork equipment and even of workloads (training only or inference only). A second phase, now underway, is witnessing amore heterogenous environment, as traffic from different families of GPUs aremixed onto one network, training and inference are being conductedsimultaneously, and, indeed, as storage traffic is being moved from “frontend” Ethernet networks to “backend” RoCE or Infiniband networks, bringing its ownhigh-bandwidth traffic into the mix. Furthermore, AI clusters are being virtualized, and “fractional” GPUs arebeing provisioned to users. Thus, theprovisioning and operation of AI clusters is transitioning to a manner verysimilar to conventional cloud computing clusters.

Motivation

The area of AI Networks is at an inflection point—it is an excellent time to take stock of the state-of-the-art of AI fabric design and stimulate an exploration of future research directions. The first phase of AI network design took a “network-centric” approach to fabric architecture in the sense that data was directly moved (DMAed) from one GPU to another over the network, with the NICs and switches doing all the work. In this case, fabric contention—arising from poorly load balanced fabric paths—is handled by proactive load balancing in the fabric via flowlet- or packet-spraying. Congestion—arising when a link or a receiving NIC is oversubscribed—is handled by hardware congestion control algorithms in the NICs. This approach, with its emphasis on the network’s hardware, is ill-suited to cope with the heterogeneity inherent in the second phase—speed mismatches in the GPUs and their associated NICs, and the mixture of traffic from training, inference and storage on the backend networks will thwart its effectiveness. In the future, we envision transport mechanisms for AI fabrics will include a significant software component—either in the collective communication libraries (CCLs) or in the smart NICs. This trend serves as the main motivation for the tutorial.

Outline

We will begin with an overview of interconnection fabrics—switching topologies, congestion control algorithms, ECMP (equal-cost multipathing), DLB (dynamic load balancing) and packet-spraying load balancing algorithms. We will introduce AI training and inference workloads, the peculiarities which distinguish AI workloads from conventional cloud computing workloads, and the performance requirements. We will delineate the key issues which come up in these contexts. We will then discuss the network-centric approaches which are employed today and highlight the problems arising in heterogeneous settings. Software-driven approaches will be introduced and the central role of clock synchronization in measuring network contention/congestion from the edge (i.e., with minimal to no network support) explained. Edge (host or NIC)-based load balancing and congestion control algorithms will be introduced and discussed in contrast to as well as in conjunction with network-based approaches. We will conclude the theoretical presentation with case studies from on-premises and public cloud fabrics. We will present live demos of such systems and invite student groups to learn about the different choices afforded by the SW-driven approach and contrast it with hybrid or network-centric approaches by selecting different mode and parameter settings.

Expected Audience and Prerequisites

We envision students, researchers, practitioners and developers all benefitting from the tutorial. Students will learn the basics and, along with researchers, get to explore new research directions. Practitioners and developers will be exposed to new software tools for monitoring and controlling AI fabrics.

Biographies

Yilong Geng is Co-Founder and CTO of Clockwork Systems. He developed Huygens, a software-based high-precision clock synchronization in his PhD thesis and used this to develop SIMON—a network tomography-based method for monitoring data center network performance. At Clockwork, he is helping to develop these ideas for a comprehensive software-driven solution for monitoring and controlling CPU clouds as well as AI GPU clusters. His work addresses critical challenges in distributed systems, enhancing both efficiency and reliability for compute-intensive applications. He enjoys working on systems problems with an algorithmic flavor. Yilong earned his Ph.D. in Electrical Engineering from Stanford University and his B.S. from Tsinghua University.

Balaji Prabhakar is VMware Founders Professor of Computer Science at Stanford University. His research interests are in Network Algorithms, Stochastic Network Theory and Societal Networks. He has been a Terman Fellow, a Sloan Foundation Fellow, and is a Fellow of the IEEE and the ACM. He has received the NSF CAREER award, the Erlang Prize from the INFORMS Applied Probability Society, and the Rollo Davidson Prize from the University of Cambridge. He is the inaugural recipient of the IEEE Innovation in Societal Infrastructure Award which recognizes “significant technological achievements and contributions to the establishment, development and proliferation of innovative societal infrastructure systems.” He has received the IEEE Koji Kobayashi Award and the ACM Sigmetrics Award for his work on Computer Networks and Cloud Computing. He has served on the Advisory Board of the Future Urban Mobility Initiative of the World Economic Forum and is a co-recipient of several best paper awards. He is a co-founder of the startups Urban Engines (acquired by Google in 2016) and Clockwork Systems.

Adi Gangidi is a Systems engineer at Meta, leading RDMA network design and deployments for AI workloads. His work on network and communication infrastructure has enabled the training of large recommendation systems and the LLaMA series of foundational models at Meta. His areas of focus span data center computing, from compute to networking. He has worked on both early designs as well as operational "scale out" systems. He is excited about working on purpose-built large-scale systems for specialized workloads.

Rohit Puri is a software engineer at Meta Inc., specializing in networking with a focus on switching architecture for data center networks. He is passionate about system performance and solving complex problems. With over 15 years of experience, Rohit has worked on successful products and initiatives, including Cisco ACI Fabric and RoCE deployments in Meta's data centers to support distributed GPU-based AI training workloads.

References

[1] Geng et al, Exploiting a natural network effect for scalable, fine-grained clock synchronization, NSDI 2018

[2] Geng et al, SIMON: A simple scalable method for sensing, inference and measurement in data center networks, NSDI 2019

[3] Breaking the transience--equilibrium nexus: A new approach to data center packet transport, NSDI 2019

[4] RDMA over Ethernet for Distributed Training at Meta Scale, Sigcomm 2024