SIGCOMM Workshop on Networks for AI Computing (NAIC)

Program

8:30AM

Registration and Breakfast

9:00AM - 9:15AM

Welcome and Opening Remark

Arpit Gupta & Vijay Sivaraman

9:15AM - 10:00AM

Keynote I: Towards AI-Centric Networking: challenges and opportunities

Kai Chen (HKUST)

Abstract:

The ever-growing AI and ML workloads present unprecedented challenges as well as opportunities for designing AI-centric networking for AI clusters. In this presentation, I will first talk about the special characteristics of communication with the distributed AI/ML training and then discuss how to explore these characteristics in designing next-generation network architecture and protocols for AI and ML workloads.

Bio:

Kai Chen is a Professor of CSE department at HKUST, the Director of HKUST iSING Lab, and a Principal Investigator for Hong Kong RGC theme-based research scheme. His main research focuses on data center networking, AI-centric networking and machine learning systems. His work has been published in various top venues like SIGCOMM and NSDI and adopted in real-world industry. Recently, he was named 2023 ACM distinguished member for contributions to the design and implementation of data center networks.

10:00AM - 10:45AM

Session I

  • Feasibility of State Space Models for Network Traffic Generation

    Andrew Chu, Xi Jiang, Shinan Liu, Arjun Bhagoji, Francesco Bronzino, Paul Schmitt, Nick Feamster

    12 mins talk + 3 mins Q&A

  • PCAPVision: PCAP-Based High-Velocity and Large-Volume Network Failure Detection

    Lukasz Tulczyjew, Ihor Biruk, Murat Bilgic, Charles Abondo, Nathanael Weill

    12 mins talk + 3 mins Q&A

  • Multi-task Aware Resource Efficient Traffic Classification via in-Network Inference

    Seongyeon Yoon, Heewon Kim, Hyeonjae Jeong, Chanbin Bae, Haeun Kim, Sangheon Pack

    12 mins talk + 3 mins Q&A

10:45AM - 11:00AM

Coffee break

11:00AM - 12:00PM

Session II

  • OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs

    Tongzhou Gu, Jiawei Fei, Marco Canini

    12 mins talk + 3 mins Q&A

  • Network Load Balancing with Parallel Flowlets for AI Training Clusters

    Peirui Cao, Wenxue Cheng, Shizhen Zhao, Yongqiang Xiong

    12 mins talk + 3 mins Q&A

  • Eloquent: A More Robust Transmission Scheme for LLM Token Streaming

    Hanchen Li, Yuhan Liu, Yihua Cheng, Siddhant Ray, Kuntai Du, Junchen Jiang

    12 mins talk + 3 mins Q&A

  • CollaSFC: An Intelligent Collaborative Approach for In-network SFC Failure Detection in Data Center for AI Computing

    Kuo Guo, Jia Chen, Qi Xu, Fei Song, Xu Huang, Shang Liu, Dongsheng Qian, Jun Zhu, Ruyun Zhang, Keping Long

    12 mins talk + 3 mins Q&A

12:00PM - 1:30PM

Lunch

1:30PM - 2:15PM

Keynote II: A peek into a large-scale production AI Training Network

Srikanth Sundaresan (Meta)

Abstract:

AI Training networks and workloads are in many ways different from classical DC networks; however, many of the challenges we face are familiar ones. In this talk, I give a brief overview of some interesting properties of Meta's AI network clusters and how they ultimately result in an age-old problem statement – how do we transfer data quickly and efficiently between end points – and how we can adapt our learning from the previous decades, what it means for AI network and protocol design.

Bio:

Srikanth Sundaresan is a Software Engineer in Meta's DC Networking team, where he develops observability and troubleshooting techniques to help understand how networks work at scale, so that they can be coaxed to perform just a bit better. Prior to Meta, he had stints at Princeton University and the International Computer Science Institute at Berkeley, and he completed his PhD at Georgia Tech. His research interests span network measurements and protocol design, and, of late, network buffers. He has won several awards for his work, including two ANRP awards, and the College of Computing Dissertation Award at Georgia Tech.

2:15PM - 2:45PM

Coffee break

2:45PM - 4:00PM

Session III

  • TCCL: Co-optimizing Collective Communication and Traffic Routing for GPU-centric Clusters

    Baojia Li, Xiaoliang Wang, Jingzhu Wang, Yifan Liu, Yuanyuan Gong, Hao Lu, Weizhen Dang, Weifeng Zhang, Xiaojie Huang, Mingzhuo Chen, Jie Chen, Chunzhi He, Yadong Liu, Xiaoyuan Hu, Chen Liu, Xuefeng Ji, Yinben Xia, Xiang Li, Zekun He, Yachen Wang, Xianneng Zou

    12 mins talk + 3 mins Q&A

  • In-Network AllReduce Optimization with Virtual Aggregation Trees

    Haoyu Song

    12 mins talk + 3 mins Q&A

  • SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep Learning

    Achref Rebai, Mubarak Adetunji Ojewale, Anees Ullah, Marco Canini, Suhaib A. Fahmy

    12 mins talk + 3 mins Q&A

  • Proof-of-Concept of a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation

    Banruo Liu, Mubarak Adetunji Ojewale, Yuhan Ding, Marco Canini

    7 mins talk + 3 mins Q&A

  • Learning to Communicate Strategically for Efficient Collective Intelligence

    Xinghai Wei, Tingting Yuan, Jie Yuan, Xiaoming Fu

    7 mins talk + 3 mins Q&A

  • Xraytest: An X-ray Test system for finding faults of RDMA-NIC Design and Implementation

    Peng Xun, Tao Li, Yulei Yuan, Hui Yang, Cunlu Li

    7 mins talk + 3 mins Q&A

4:00PM - 4:30PM

Coffee Break

6:30PM - 9:00PM

Reception

Call for Papers

Generative AI is transforming many aspects of modern society with content ranging from text and image to videos. The Large Language Models (LLMs) and other Artificial Intelligence (AI)/Machine Learning (ML) that enable these generative AI capabilities are placing an unprecedented amount of pressure on modern data centers with anecdotal evidence suggesting that the largest model can take months to train. To support these models, modern distributed training cluster contains tens of thousands of GPUs/TPUs, with many expecting the scale to further increase significantly.

More fundamentally, training these large models introduce network communication patterns that require sophisticated and novel topology, routing, and synchronization. As the adoption and use of such models grows, the data generated and the data required to train and make inferences will place on emphasis on the design of novel network primitives. The scale, workload, and performance requirements require us to reconsider every layer of the network stacks and scrutinize the solution from a holistic perspective. The recent industry initiative, Ultra Ethernet Consortium (UEC), is actively working on Ethernet-based network optimizations for AI and HPC workloads. The Open Compute Project (OCP) is geared more toward infrastructure support for AI computing. The standard organizations (e.g., IETF) are also seeking opportunities in networking for AI computing. We believe the networking research community should take a bolder position and bring cutting-edge innovations in this front as well.

The workshop aims to bring together researchers and experts from academia and industry to share the latest research, trends, and challenges in cloud and data center networks for AI computing. We expect it to enrich our understanding of the AI workloads, communication patterns, and their impacts on networks, and help the community to identify future research directions. We encourage lively debate on issues like convergence vs disaggregation, front-end vs back-end, smart edges vs programmable core, and the need for new interconnection, new topology, new transport, and new routing algorithms and protocols.

Topics of Interest

Topics of interest include, but are not limited to:

Submission Instructions

We invite researchers and practitioners to submit original research papers, including position papers on disruptive ideas, and early-stage work with a potential for full papers in the future. We accept two types of submissions:

Please submit your paper via https://naic24.hotcrp.com/.

Important Dates

Organizers

Workshop Chairs
Arpit Gupta UCSB
Xin Jin Peking University
Vijay Sivaraman University Of New South Wales
Programe Committee
Jonathan Chao NYU
Kai Chen HKUST
Nandita Dukkipati Google
Chen-Yu Ho KAUST
Kevin Hsieh Microsoft
Junchen Jiang U of Chicago
John Kim KAIST
Murali Kodialam Nokia Bell Labs
Dirk Kutscher HKUST Guangzhou
Alan Zaoxing Liu U of Maryland
Chen Tian NJU
Stefan Schmid TU Berlin
Muhammad Shaba Purdue
Haoyu Song Futurewei
Tushar Swamy Stanford
Jeff Tantsura Nvidia
Mukarram Tariq Google
Marina Thottan AWS
Wenfei Wu Peking University
Qiao Xiang Xiamen University
Yang Xu Fudan University
Ying Zhang Meta
Zhi-li Zhang U of Minnesota
Local Arrangement Chair
Minzhao Lyu University Of New South Wales
Steering Committee
Theophilus A. Benson CMU
Torsten Hoefler ETH Zurich
TV Lakshman Nokia Bell Labs
Haoyu Song Futurewei
Ying Zhang Meta
Zhi-li Zhang U of Minnesota