Full-day Tutorial: Ethernet Networking for AI workloads
The tutorial will take place at Room Mondego.
- Mark Handley (OpenAI & University College London)
- Brad Karp (Google & University College London)
- Costin Raiciu (Broadcom & University Politehnica of Bucharest)
| 09:00 — 10:30 |
Background on AI pre-training
|
|---|---|
| 10:30 — 11:00 | Morning coffee break |
| 11:00 — 12:45 | Network topologies, existing transports and performance pathologies. The need for multipath transport: flowlet routing, adaptive switch routing, multipath transport, spraying. |
| 12:45 — 14:00 | Lunch break |
| 14:00 — 15:45 | Transport protocol building blocks for AI - spraying, trimming and congestion signalling (CSIG). Ultra Ethernet Transport overview. |
| 15:45 — 16:15 | Afternoon coffee break |
| 16:15 — 16:30 | Open problems |
| 16:30 — 18:00 | Hands on simulation session with htsim. |
Given the many differences between AI and existing workloads, it should be no surprise that there are significant industry standardization efforts to address AI workloads, with the goal of deploying solutions in the near term. The main forum for such standardization has been the Ultra Ethernet Consortium (UEC) [11], an industry consortium that includes tens of companies representing networking hardware vendors, GPU vendors, cloud service providers, and other stakeholders.
The UEC specification [12] has been released in June 25 and include a transport protocol (called UET), novel link-layer flow control, and other lower-layer designs. Key differences between UET and deployed data center transports are the adoption of multipath transmission and a departure from status-quo hardware transports’ favoring of lossless operation. Additionally, switches will optionally support packet trimming and offer fine-grained congestion information in the form of congestion signalling (CSIG) [10].
Our goal in this tutorial is to simultaneously educate the SIGCOMM audience about the characteristics of AI workloads that motivate such a transport design; discuss the transport design decisions behind UET 1.0; and highlight open areas of research for the networking community in this space. Finally, the tutorial will use a simulator implementation of the UEC transport in a hands-on session that will help bootstrap work in this space for those interested.
The recent advent of chat assistants (e.g., ChatGPT, Gemini) has sparked a technological drive to advance machine intelligence to reach a sophistication similar to that of humans. The technology at the center of this drive is the large language model (LLM, e.g., the transformer architecture [1]), for which scaling laws [4,5] predict that bigger models perform better. These efforts have led to huge investments in datacenter infrastructure optimized for machine learning, with the goal to support ever-larger models. Last year multiple companies announced infrastructure plans in the hundreds of billions of dollars that will result in datacenters housing millions of GPUs [6].
The tech industry has previously built data centers hosting a similar number of compute cores. So do these planned builds warrant particular attention from the research community? We argue that the answer is a strong yes, because GPU-based data centers built for ML are optimized for a significantly different workload than existing CPU-based cloud data centers:
- Tightly synchronous compute, where job size = cluster size: While the core count is of the same order of magnitude in CPU and GPU clusters, it is common for an AI training job to use the entire GPU cluster (e.g., hundreds of thousands of cores), whereas full-cluster-scale jobs are less common in CPU data centers. As a consequence, in ML training workloads, tail latency of data center-wide distributed computation is the main limiting factor for performance.
- Two network types instead of one: In GPU clusters, GPUs are connected to a local, higher-speed/smaller-scale scale-up network (e.g., an NVSwitch domain) and to the traditional scale-out data center network. In contrast, past cloud networks have relied only on a single data center-wide network.
- Hardware transport: While earlier CPU-based cloud clusters relied on software-based transport stacks (e.g., TCP/IP), more recent ones use hardware-based transports such as RDMA, as do emerging GPU-based clusters. ML training workloads demand transfer rates of 400-800 Gbps and beyond, vs. the 100-200 Gbps in earlier-generation cloud networks.
- Line-rate flows: It is common in ML training for a single flow to fully consume the NIC’s capacity. CPU-driven flows are often limited by software stack speeds to (typically well below) 100 Gbps. This change affects transport protocol design.
- Low per-host flow-level parallelism: In GPU-based networks, few flows at each host are active simultaneously, whereas in traditional cloud networks, tens of flows are typically active per host. Fewer flows make traffic load balancing across the available paths in the network harder.
- Synchronized Flow Starts: AI training workloads consist of synchronous computation interspersed with synchronized collective communication operations (e.g., all-reduce), where many nodes start sending precisely simultaneously and training progress depends strongly on worst-case tail flow completion time.
This full-day tutorial will include lectures and a hands-on session. It will begin with an overview of transformers [1] / MoE [2,3], and discuss what training entails. It will then discuss the collective communication primitives used by AI apps for training, and the emerging traffic patterns, as well as the new types of networks being built for AI.
In the afternoon we will discuss existing transports (in particular RoCEv2) and their problems and provide a deep dive into UET, a sprayed transport recently standardized by the UEC that can use novel switch features including packet trimming. We will also describe Congestion Signaling (CSIG), a new, low-overhead technique through which end hosts can learn about the fine-grained congestion state of a bottleneck switch, and how congestion control can benefit from CSIG information.
Finally, we will highlight open problems that deserve research attention and conclude with a hands-on simulator session where participants can get first hand experience with UET.
- Basic understanding of networking including TCP/IP and cloud networks (Clos / Fat Tree topologies).
- Laptop with clang or GCC compiler to participate in the hands-on tasks.
Costin Raiciu is an Architect in Broadcom’s Core Switching Group and Professor at Politehnica of Bucharest.
Mark Handley is Tech Lead for Networking at OpenAI and a Professor at University College London.
Brad Karp is a Visiting Faculty Researcher at Google working on networking for AI workloads and a Professor at University College London.
Costin and Mark have worked on many of the topics relevant to UET including multipath transport (MPTCP [7]), packet spraying, and trimming (NDP [8], EQDS [9]). They were heavily involved in the UET standardization work as part of their industry involvement. Mark Handley was the editor of the congestion control specification for UET, while Costin Raiciu was the editor for the trimming specification. Brad Karp leads Google’s standardization effort at UEC for congestion signaling (CSIG) [10].
[1] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in Neural Information Processing Systems (NeurIPS), 30 (2017).
[2] William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23, 120 (2022), 1–39.
[3] Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017).
[4] Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, E Buchatskaya, T Cai, E Rutherford, DdL Casas, LA Hendricks, J Welbl, A Clark, et al. An empirical analysis of compute-optimal large language model training. NeurIPS, 35 (2022).
[5] Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020).
[6] Alexandru M. Gherghescu, Vlad-Andrei Bădoiu, Alexandru Agache, Mihai-Valentin Dumitru, Iuliu Vasilescu, Radu Mantu, and Costin Raiciu. I've Got 99 Problems But FLOPS Ain't One. In Proceedings of the 23rd ACM Workshop on Hot Topics in Networks (HotNets '24). ACM, 195–204, 2024.
[7] Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik, and Mark Handley. Improving datacenter performance and robustness with multipath TCP. In Proceedings of the ACM SIGCOMM 2011 conference (SIGCOMM '11), 266–277. ACM, 2011.
[8] Mark Handley, Costin Raiciu, Alexandru Agache, Andrei Voinescu, Andrew W. Moore, Gianni Antichi, and Marcin Wójcik. Re-architecting datacenter networks and stacks for low latency and high performance. In Proceedings of the SIGCOMM '17, ACM, 29–42, 2017.
[9] Vladimir Olteanu, Haggai Eran, Dragos Dumitrescu, Adrian Popa, Cristi Baciu, Mark Silberstein, Georgios Nikolaidis, Mark Handley, and Costin Raiciu. An edge-queued datagram service for all datacenter traffic. 19th USENIX Symposium on Networked Systems Design and Implementation (NSDI 22).
[10] Abhiram Ravi, Nandita Dukkipati, Naoshad Mehta, Jai Kumar. Congestion Signalling (CSIG). IETF working document: draft-ravi-ippm-csig-01 https://www.ietf.org/archive/id/draft-ravi-ippm-csig-01.html
[11] Ultra Ethernet Consortium. Overview of and Motivation for the Forthcoming Ultra Ethernet Consortium Specification. https://ultraethernet.org/wp-content/uploads/sites/20/2023/10/23.07.12-UEC-1.0-Overview-FINAL-WITH-LOGO.pdf
[12] Ultra Ethernet Consortium. Ultra Ethernet Specification v1.0. https://ultraethernet.org/wp-content/uploads/sites/20/2025/06/UE-Specification-6.11.25.pdf