SIGCOMM Workshop on Networks for AI Computing (NAIC)

Program

8:30AM

Registration and Breakfast

9:00AM - 9:15AM

Welcome and Opening Remark

Arpit Gupta & Vijay Sivaraman

9:15AM - 10:00AM

Keynote I: Towards AI-Centric Networking: challenges and opportunities

Kai Chen (HKUST)

Abstract:

The ever-growing AI and ML workloads present unprecedented challenges as well as opportunities for designing AI-centric networking for AI clusters. In this presentation, I will first talk about the special characteristics of communication with the distributed AI/ML training and then discuss how to explore these characteristics in designing next-generation network architecture and protocols for AI and ML workloads.

Bio:

Kai Chen is a Professor of CSE department at HKUST, the Director of HKUST iSING Lab, and a Principal Investigator for Hong Kong RGC theme-based research scheme. His main research focuses on data center networking, AI-centric networking and machine learning systems. His work has been published in various top venues like SIGCOMM and NSDI and adopted in real-world industry. Recently, he was named 2023 ACM distinguished member for contributions to the design and implementation of data center networks.

10:00AM - 10:45AM

Session I

Feasibility of State Space Models for Network Traffic Generation
Andrew Chu, Xi Jiang, Shinan Liu, Arjun Bhagoji, Francesco Bronzino, Paul Schmitt, Nick Feamster
12 mins talk + 3 mins Q&A
PCAPVision: PCAP-Based High-Velocity and Large-Volume Network Failure Detection
Lukasz Tulczyjew, Ihor Biruk, Murat Bilgic, Charles Abondo, Nathanael Weill
12 mins talk + 3 mins Q&A
Multi-task Aware Resource Efficient Traffic Classification via in-Network Inference
Seongyeon Yoon, Heewon Kim, Hyeonjae Jeong, Chanbin Bae, Haeun Kim, Sangheon Pack
12 mins talk + 3 mins Q&A

10:45AM - 11:00AM

Coffee break

11:00AM - 12:00PM

Session II

OmNICCL: Zero-cost Sparse AllReduce with Direct Cache Access and SmartNICs
Tongzhou Gu, Jiawei Fei, Marco Canini
12 mins talk + 3 mins Q&A
Network Load Balancing with Parallel Flowlets for AI Training Clusters
Peirui Cao, Wenxue Cheng, Shizhen Zhao, Yongqiang Xiong
12 mins talk + 3 mins Q&A
Eloquent: A More Robust Transmission Scheme for LLM Token Streaming
Hanchen Li, Yuhan Liu, Yihua Cheng, Siddhant Ray, Kuntai Du, Junchen Jiang
12 mins talk + 3 mins Q&A
CollaSFC: An Intelligent Collaborative Approach for In-network SFC Failure Detection in Data Center for AI Computing
Kuo Guo, Jia Chen, Qi Xu, Fei Song, Xu Huang, Shang Liu, Dongsheng Qian, Jun Zhu, Ruyun Zhang, Keping Long
12 mins talk + 3 mins Q&A

12:00PM - 1:30PM

Lunch

1:30PM - 2:15PM

Keynote II: A peek into a large-scale production AI Training Network

Srikanth Sundaresan (Meta)

Abstract:

AI Training networks and workloads are in many ways different from classical DC networks; however, many of the challenges we face are familiar ones. In this talk, I give a brief overview of some interesting properties of Meta's AI network clusters and how they ultimately result in an age-old problem statement – how do we transfer data quickly and efficiently between end points – and how we can adapt our learning from the previous decades, what it means for AI network and protocol design.

Bio:

Srikanth Sundaresan is a Software Engineer in Meta's DC Networking team, where he develops observability and troubleshooting techniques to help understand how networks work at scale, so that they can be coaxed to perform just a bit better. Prior to Meta, he had stints at Princeton University and the International Computer Science Institute at Berkeley, and he completed his PhD at Georgia Tech. His research interests span network measurements and protocol design, and, of late, network buffers. He has won several awards for his work, including two ANRP awards, and the College of Computing Dissertation Award at Georgia Tech.

2:15PM - 2:45PM

Coffee break

2:45PM - 4:00PM

Session III

TCCL: Co-optimizing Collective Communication and Traffic Routing for GPU-centric Clusters
Baojia Li, Xiaoliang Wang, Jingzhu Wang, Yifan Liu, Yuanyuan Gong, Hao Lu, Weizhen Dang, Weifeng Zhang, Xiaojie Huang, Mingzhuo Chen, Jie Chen, Chunzhi He, Yadong Liu, Xiaoyuan Hu, Chen Liu, Xuefeng Ji, Yinben Xia, Xiang Li, Zekun He, Yachen Wang, Xianneng Zou
12 mins talk + 3 mins Q&A
In-Network AllReduce Optimization with Virtual Aggregation Trees
Haoyu Song
12 mins talk + 3 mins Q&A
SqueezeNIC: Low-Latency In-NIC Compression for Distributed Deep Learning
Achref Rebai, Mubarak Adetunji Ojewale, Anees Ullah, Marco Canini, Suhaib A. Fahmy
12 mins talk + 3 mins Q&A
Proof-of-Concept of a Flexible and High-Fidelity Approach to Distributed DNN Training Emulation
Banruo Liu, Mubarak Adetunji Ojewale, Yuhan Ding, Marco Canini
7 mins talk + 3 mins Q&A
Learning to Communicate Strategically for Efficient Collective Intelligence
Xinghai Wei, Tingting Yuan, Jie Yuan, Xiaoming Fu
7 mins talk + 3 mins Q&A
Xraytest: An X-ray Test system for finding faults of RDMA-NIC Design and Implementation
Peng Xun, Tao Li, Yulei Yuan, Hui Yang, Cunlu Li
7 mins talk + 3 mins Q&A

4:00PM - 4:30PM

Coffee Break

6:30PM - 9:00PM

Reception

Call for Papers

Generative AI is transforming many aspects of modern society with content ranging from text and image to videos. The Large Language Models (LLMs) and other Artificial Intelligence (AI)/Machine Learning (ML) that enable these generative AI capabilities are placing an unprecedented amount of pressure on modern data centers with anecdotal evidence suggesting that the largest model can take months to train. To support these models, modern distributed training cluster contains tens of thousands of GPUs/TPUs, with many expecting the scale to further increase significantly.

More fundamentally, training these large models introduce network communication patterns that require sophisticated and novel topology, routing, and synchronization. As the adoption and use of such models grows, the data generated and the data required to train and make inferences will place on emphasis on the design of novel network primitives. The scale, workload, and performance requirements require us to reconsider every layer of the network stacks and scrutinize the solution from a holistic perspective. The recent industry initiative, Ultra Ethernet Consortium (UEC), is actively working on Ethernet-based network optimizations for AI and HPC workloads. The Open Compute Project (OCP) is geared more toward infrastructure support for AI computing. The standard organizations (e.g., IETF) are also seeking opportunities in networking for AI computing. We believe the networking research community should take a bolder position and bring cutting-edge innovations in this front as well.

The workshop aims to bring together researchers and experts from academia and industry to share the latest research, trends, and challenges in cloud and data center networks for AI computing. We expect it to enrich our understanding of the AI workloads, communication patterns, and their impacts on networks, and help the community to identify future research directions. We encourage lively debate on issues like convergence vs disaggregation, front-end vs back-end, smart edges vs programmable core, and the need for new interconnection, new topology, new transport, and new routing algorithms and protocols.

Topics of Interest

Topics of interest include, but are not limited to:

Technologies for RDMA and Ethernet efficiency, performance, security, and extensibility
Load balancing for distributed learning
Lossless and loss-tolerate network design
Host and network integration and coordination
New transport protocol and congestion control for AI training
Storage network optimization
New network architecture and topologies for AI and HPC
Non-minimal adaptive routing
SmartNIC/DPU offloading
AI backend networking system design and optimization
Programmable networks for AI workload
In network computing techniques and protocols for distributed training and MPI support
Application aware networking for AI training and inference
Collective communication optimization
Networking for cross-DC learning
Cross-layer optimization
Convergence of computing, storage, and networking
IPv6/SRv6-based hyperscale AI Datacenter
Automate and intelligent AI DCN OAM
LLM for DCN OAM
Fault prediction, detection, and root cause analysis
New measurement and telemetry metrics and methods
Green data center for energy efficiency
Traffic characterization for AI workload
Network simulation and benchmarking for AI workload

Submission Instructions

We invite researchers and practitioners to submit original research papers, including position papers on disruptive ideas, and early-stage work with a potential for full papers in the future. We accept two types of submissions:

Regular research papers of up to 6 pages, excluding references and appendices, in two-column 10pt ACM format. Submissions must be original, unpublished work, and not under consideration at another conference or journal. Papers must exclude author names and affiliations for double-blind peer reviewing by the TPC. Authors of accepted submissions are expected to present their work at the workshop.
Extended abstracts of up to 2 pages, excluding references, in the same format as the regular papers. Submissions are about early-stage works and position papers that are still in progress, for authors to showcase their preliminary ideas to get early-stage feedback at the workshop. The authors are expected to present their work in the form of a lighting talk during the workshop. The submission will not be included in the conference proceedings

Please submit your paper via https://naic24.hotcrp.com/.

Important Dates

Submissions deadline
May 31, 23:59, AoE, 2024
Acceptance notification
June 10, 2024
Camera-ready deadline
June 14, 2024
Workshop
Aug 4, 2024

Organizers

Workshop Chairs
Arpit Gupta	UCSB
Xin Jin	Peking University
Vijay Sivaraman	University Of New South Wales
Programe Committee
Jonathan Chao	NYU
Kai Chen	HKUST
Nandita Dukkipati	Google
Chen-Yu Ho	KAUST
Kevin Hsieh	Microsoft
Junchen Jiang	U of Chicago
John Kim	KAIST
Murali Kodialam	Nokia Bell Labs
Dirk Kutscher	HKUST Guangzhou
Alan Zaoxing Liu	U of Maryland
Chen Tian	NJU
Stefan Schmid	TU Berlin
Muhammad Shaba	Purdue
Haoyu Song	Futurewei
Tushar Swamy	Stanford
Jeff Tantsura	Nvidia
Mukarram Tariq	Google
Marina Thottan	AWS
Wenfei Wu	Peking University
Qiao Xiang	Xiamen University
Yang Xu	Fudan University
Ying Zhang	Meta
Zhi-li Zhang	U of Minnesota
Local Arrangement Chair
Minzhao Lyu	University Of New South Wales
Steering Committee
Theophilus A. Benson	CMU
Torsten Hoefler	ETH Zurich
TV Lakshman	Nokia Bell Labs
Haoyu Song	Futurewei
Ying Zhang	Meta
Zhi-li Zhang	U of Minnesota