Arpit Gupta & Vijay Sivaraman
Kai Chen (HKUST)
The ever-growing AI and ML workloads present unprecedented challenges as well as opportunities for designing AI-centric networking for AI clusters. In this presentation, I will first talk about the special characteristics of communication with the distributed AI/ML training and then discuss how to explore these characteristics in designing next-generation network architecture and protocols for AI and ML workloads.
Kai Chen is a Professor of CSE department at HKUST, the Director of HKUST iSING Lab, and a Principal Investigator for Hong Kong RGC theme-based research scheme. His main research focuses on data center networking, AI-centric networking and machine learning systems. His work has been published in various top venues like SIGCOMM and NSDI and adopted in real-world industry. Recently, he was named 2023 ACM distinguished member for contributions to the design and implementation of data center networks.
Andrew Chu, Xi Jiang, Shinan Liu, Arjun Bhagoji, Francesco Bronzino, Paul Schmitt, Nick Feamster
12 mins talk + 3 mins Q&A
Lukasz Tulczyjew, Ihor Biruk, Murat Bilgic, Charles Abondo, Nathanael Weill
12 mins talk + 3 mins Q&A
Seongyeon Yoon, Heewon Kim, Hyeonjae Jeong, Chanbin Bae, Haeun Kim, Sangheon Pack
12 mins talk + 3 mins Q&A
Tongzhou Gu, Jiawei Fei, Marco Canini
12 mins talk + 3 mins Q&A
Peirui Cao, Wenxue Cheng, Shizhen Zhao, Yongqiang Xiong
12 mins talk + 3 mins Q&A
Hanchen Li, Yuhan Liu, Yihua Cheng, Siddhant Ray, Kuntai Du, Junchen Jiang
12 mins talk + 3 mins Q&A
Kuo Guo, Jia Chen, Qi Xu, Fei Song, Xu Huang, Shang Liu, Dongsheng Qian, Jun Zhu, Ruyun Zhang, Keping Long
12 mins talk + 3 mins Q&A
Srikanth Sundaresan (Meta)
AI Training networks and workloads are in many ways different from classical DC networks; however, many of the challenges we face are familiar ones. In this talk, I give a brief overview of some interesting properties of Meta's AI network clusters and how they ultimately result in an age-old problem statement – how do we transfer data quickly and efficiently between end points – and how we can adapt our learning from the previous decades, what it means for AI network and protocol design.
Srikanth Sundaresan is a Software Engineer in Meta's DC Networking team, where he develops observability and troubleshooting techniques to help understand how networks work at scale, so that they can be coaxed to perform just a bit better. Prior to Meta, he had stints at Princeton University and the International Computer Science Institute at Berkeley, and he completed his PhD at Georgia Tech. His research interests span network measurements and protocol design, and, of late, network buffers. He has won several awards for his work, including two ANRP awards, and the College of Computing Dissertation Award at Georgia Tech.
Baojia Li, Xiaoliang Wang, Jingzhu Wang, Yifan Liu, Yuanyuan Gong, Hao Lu, Weizhen Dang, Weifeng Zhang, Xiaojie Huang, Mingzhuo Chen, Jie Chen, Chunzhi He, Yadong Liu, Xiaoyuan Hu, Chen Liu, Xuefeng Ji, Yinben Xia, Xiang Li, Zekun He, Yachen Wang, Xianneng Zou
12 mins talk + 3 mins Q&A
Haoyu Song
12 mins talk + 3 mins Q&A
Achref Rebai, Mubarak Adetunji Ojewale, Anees Ullah, Marco Canini, Suhaib A. Fahmy
12 mins talk + 3 mins Q&A
Banruo Liu, Mubarak Adetunji Ojewale, Yuhan Ding, Marco Canini
7 mins talk + 3 mins Q&A
Xinghai Wei, Tingting Yuan, Jie Yuan, Xiaoming Fu
7 mins talk + 3 mins Q&A
Peng Xun, Tao Li, Yulei Yuan, Hui Yang, Cunlu Li
7 mins talk + 3 mins Q&A
Generative AI is transforming many aspects of modern society with content ranging from text and image to videos. The Large Language Models (LLMs) and other Artificial Intelligence (AI)/Machine Learning (ML) that enable these generative AI capabilities are placing an unprecedented amount of pressure on modern data centers with anecdotal evidence suggesting that the largest model can take months to train. To support these models, modern distributed training cluster contains tens of thousands of GPUs/TPUs, with many expecting the scale to further increase significantly.
More fundamentally, training these large models introduce network communication patterns that require sophisticated and novel topology, routing, and synchronization. As the adoption and use of such models grows, the data generated and the data required to train and make inferences will place on emphasis on the design of novel network primitives. The scale, workload, and performance requirements require us to reconsider every layer of the network stacks and scrutinize the solution from a holistic perspective. The recent industry initiative, Ultra Ethernet Consortium (UEC), is actively working on Ethernet-based network optimizations for AI and HPC workloads. The Open Compute Project (OCP) is geared more toward infrastructure support for AI computing. The standard organizations (e.g., IETF) are also seeking opportunities in networking for AI computing. We believe the networking research community should take a bolder position and bring cutting-edge innovations in this front as well.
The workshop aims to bring together researchers and experts from academia and industry to share the latest research, trends, and challenges in cloud and data center networks for AI computing. We expect it to enrich our understanding of the AI workloads, communication patterns, and their impacts on networks, and help the community to identify future research directions. We encourage lively debate on issues like convergence vs disaggregation, front-end vs back-end, smart edges vs programmable core, and the need for new interconnection, new topology, new transport, and new routing algorithms and protocols.
Topics of interest include, but are not limited to:
We invite researchers and practitioners to submit original research papers, including position papers on disruptive ideas, and early-stage work with a potential for full papers in the future. We accept two types of submissions:
Please submit your paper via https://naic24.hotcrp.com/.