2nd Workshop on Networks for AI Computing (NAIC)

Monday, September 8th | Full-day Workshop

Location

The workshop will take place at Room Ines de Castro.

Program

08:00 — 08:45 | Registration

08:45 — 09:00 | Welcoming Remarks

09:00 — 09:55 | Keynote 1: Improving the Performance and Resiliency for Large-Scale Distributed Training | Minlan Yu (Harvard University)

Abstract: The increasing scale of distributed training presents two major challenges: First, synchronization across the network incurs significant overhead to the training time; Second, the frequency of failures increases, and a single faulty machine can halt the entire training task for hours and thus requires fast detection and recovery. To tackle these challenges, it is critical to enable closer integrations of systems and machine learning. Networks should be optimized towards ML-level performance and accuracy, rather than relying on network-level metrics. Similarly, fast fault localization, diagnosis, and recovery require coordinating upper-level ML logics and the underlying system data and actions. In this talk, we will show a few examples of such integration in our recent work.

Biography: Minlan Yu is a Gordon McKay professor at the Harvard School of Engineering and Applied Science. She's the assistant director of the SRC/DARPA JUMP 2.0 ACE Center for Evolvable Computing and the co-director for the Harvard Power and AI initiative. She received her B.A. in computer science and mathematics from Peking University and her M.A. and PhD in computer science from Princeton University. She received the ACM-W Rising Star Award, NSF CAREER award, and ACM SIGCOMM doctoral dissertation award. She served as PC co-chair for SIGCOMM, NSDI, HotNets, and several other conferences and workshops.

09:55 — 10:30 | Session 1: Taming AI Networking | Chair: Chen Tian

09:55 - 10:10

Why choose when you can have both: Programmable data planes meet programmable optics

Chris Misa (University of Oregon), Matthew Nance-Hall (Scala Computing, Inc.), Reza Rejaie (University of Oregon), Walter Willinger (NIKSUN, Inc.), Ramakrishnan Durairajan (University of Oregon and Link Oregon)

10:10 - 10:20

Sparse Collectives: Exploiting Data Sparsity to Improve Communication Efficiency (Short)

Dhananjaya Wijerathne, Haris Javaid, Guanwen Zhong, Dan Wu, Xing Yuan Kom, Mario Baldi (AMD)

10:20 - 10:30

Codesign of Tensors Encoding And Transcoding: A Building Block For Decentralized AI (Short)

Revant Teotia, Muhammad Haseeb (New York University)

10:30 — 11:00 | Morning coffee break

11:00 — 11:40 | Session 2: SimCity for AI Systems | Chair: Qingkai Meng

11:00 - 11:15

NSX: Large-Scale Network Simulation on an AI Server

Sajy Khashab, Hariharan Sezhiyan, Rani Abboud, Alex Normatov, Stefan Kaestle, Eliav Bar-Ilan, Mohammad Nassar, Omer Shabtai, Wei Bai, Matty Kadosh, Jiarong Xing (Rice University), Mark Silberstein (NVIDIA and Technion), T. S. Eugene Ng (Rice University), Ang Chen (NVIDIA and University of Michigan)

11:15 - 11:30

MLSynth: Towards Synthetic ML Traces

Adel Sefiane (NVIDIA and Imperial College London), Alireza Farshin (NVIDIA), Marios Kogias (Imperial College London)

11:30 - 11:40

Simulating LLM training workloads for heterogeneous compute and network infrastructure (Short)

Sumit Kumar, Arjun Temura, Naman Sharma, Ramanjeet Singh (Indraprastha Institute of Information Technology Delhi), Meet Dadhania, Praveen Tammana (Indian Institute of Technology Hyderabad), Satananda Burla, Abed Mohammad Kamaluddin (Marvell Technology Inc.), Rinku Shah (Indraprastha Institute of Information Technology Delhi)

11:40 — 12:45 | Session 3: Optimizing the AI Journey | Chair: Stefano Salsano

11:40 - 11:50

RTT- or Bandwidth-Bound? Demystifying the KV Cache Transfer in Large Language Model Serving (Short)

Shengnan Yue (China Mobile), Mowei Wang (Huawei Technologies), Yu Yan (China Mobile), Weiqiang Cheng (China Mobile), Zihan Jiang (Huawei Technologies), Zhenhui Zhang (China Mobile and Nanjing University)

11:50 - 12:05

Best Paper Award

Chronos: Prescheduled Circuit Switching for LLM Training

Sundararajan Renganathan, Nick McKeown (Stanford University)

12:05 - 12:15

Intent Fuel Station: A RAG-Enhanced Agent Hub for Realizing Networking Intents (Short)

Yunxiao Ma, Peilong Zhang, Hua Li, Yufeng Zhao, Yintan Ai, Hanlin Liu (Inner Mongolia University)

12:15 - 12:30

LAPS: Joint Load Balancing and Congestion Control on Unequal-cost Multi-path Data Center Networks

Ying Wan (Southeast University), Haoyu Song (Futurewei Technologies), Yu Jia, Yunhui Yang (China Mobile), Tao Huang (Purple Mountain Laboratories), Zhikang Chen (Tsinghua University)

12:30 - 12:45

Sparkle: Optimizing the Serverless AIGC Deployment over Crowdsourced Edge Environments

Kaile Zhu, Shihao Shen (Tianjin University), Tuo Zhang (University of Southern California), Xiaofei Wang (Tianjin University), Xiaoliang Wang (Nanjing University), Xin Jiang, Wenyu Wang (Paiou Cloud Computing), Hai Jin (Huazhong University of Science and Technology)

12:45 — 14:00 | Lunch Break

14:05 — 14:55 | Keynote 2: Cross-Layer Innovations in Network Design for AI at Meta | Ying Zhang (Meta)

Abstract: As AI workloads continue to scale at Meta, the network infrastructure must evolve to meet the demanding communication patterns and performance requirements of large-scale distributed training and inference. This talk presents a cross-layer perspective on network innovations tailored for AI systems at Meta, highlighting key research advances that span from datacenter topology design, to congestion control, to multi-tenant collective communication and GPU communication over multiple NICs. Together, these works demonstrate the power of a cross-layer approach—integrating network architecture, system software, and hardware capabilities—to advance the state of AI networking at Meta.

Biography: Ying Zhang is a Senior Engineering Manager at Meta. She leads an engineering team to support the world's largest Backbone Network. She works on large scale network management problems and her research interests are in software-defined networks, network function virtualization, network monitoring, Internet routing, and network security. She has 100+ granted US/International patents, 150+ peer-reviewed publications with over 8000 citations.

15:00 — 15:45 | Session 4: LLMs Without Borders | Chair: Haoyu Song

15:00 - 15:15

A Cloud-Edge Collaborative Inference System for Data-secure LLM Serving

Wenjie Chu, Chunhui Du, Yunfeng Shao (Huawei Technologies)

15:15 - 15:30

LLMs on Edge: Network Traffic Characteristics of Distributed Inference under the Loupe

Philippe Buschmann, Arne Bröring (Siemens AG), Georg Carle (Technical University of Munich), Andreas Blenk (Siemens AG)

15:30 - 15:45

LIFT: Automating Symbolic Execution Optimization with Large Language Models for AI Networks

Ruoxi Wang (Northeastern University), Kun Li, Minghui Xu, Yue Zhang (Shandong University), Kaidi Xu (Drexel University), Chunchi Liu (Huawei Technologies), Xiuzhen Cheng (Shandong University), Yinhao Xiao (Guangdong University of Finance and Economics)

15:45 — 16:15 | Afternoon Coffee Break

16:15 — 17:15 | Session 5: Smart Moves in AI Networking | Chair: Marios Kogias

16:15 - 16:30

Reconfigurability within Collective Communication Algorithms

Rukshani Athapathu, George Porter (University of California San Diego)

16:30 - 16:45

T3P: Topology-Tailored Tensor Parallelism

Saar Ben-Yochana, Chen Avin, Gabriel Scalosub (Ben Gurion University of the Negev)

16:45 - 17:00

Quantifying the Impact of Job Placement and Routing on Network Efficiency in AI Clusters

Dante Van Poucke, Didier Colle, Mario Pickavet, Wouter Tavernier (Ghent University - Imec)

17:00 - 17:15

AMSO-INT: Reinforcement Learning-Driven Dynamic Adaptive In-Band Telemetry

Yuqi Li, Tianyu Chen (Jiangsu University), Vladimir Ciric (University of Nis), Changda Wang (Jiangsu University)

17:15 — 18:00 | Panel Discussion: Will AI Eat the Network… or Just Snack on It? | Chair: Haoyu Song (Futurewei) | Panelists:

Kai Chen (HKUST)
Dongsu Han (KAIST)
Ramana Kompella (Cisco)
Nick McKeown (Stanford)
Mark Silberstein (Technion)
Keyi Zhu (Huawei)

Network challenges and opportunities in the age of agentic AI and Internet of Agents

Call for Papers

Generative AI is transforming many aspects of modern society with content ranging from text and image to videos. The Large Language Models (LLMs) and other Artificial Intelligence (AI)/Machine Learning (ML) that enable these generative AI capabilities are placing an unprecedented amount of pressure on modern data centers with anecdotal evidence suggesting that the largest model can take months to train. To support these models, modern distributed training clusters contain tens of thousands of GPUs/TPUs, with many expecting the scale to further increase significantly.

More fundamentally, training these large models introduces network communication patterns that require sophisticated and novel topology, routing, and synchronization. As the adoption and use of such models grows, the data generated and the data required to train and make inferences will place emphasis on the design of novel network primitives. The scale, workload, and performance requirements require us to reconsider every layer of the network stacks and scrutinize the solution from a holistic perspective. The recent industry initiative, Ultra Ethernet Consortium (UEC), is actively working on Ethernet-based network optimizations for AI and HPC workloads. The Open Compute Project (OCP) is geared more toward infrastructure support for AI computing. The standard organizations (e.g., IETF) are also seeking opportunities in networking for AI computing. We believe the networking research community should take a bolder position and bring cutting-edge innovations in this front as well.

The workshop aims to bring together researchers and experts from academia and industry to share the latest research, trends, and challenges in cloud and data center networks for AI computing. We expect it to enrich our understanding of the AI workloads, communication patterns, and their impacts on networks, and help the community to identify future research directions. We encourage lively debate on issues like convergence vs disaggregation, front-end vs back-end, smart edges vs programmable core, and the need for new interconnection, new topology, new transport, and new routing algorithms and protocols.

Topics of Interest

Topics of interest include, but are not limited to:

Technologies for RDMA and Ethernet efficiency, performance, security, and extensibility
Load balancing for distributed learning
Lossless and loss-tolerant network design
Host and network integration and coordination
New transport protocols and congestion control for AI training
Programmable congestion control
New network architecture and topologies for AI and HPC
Non-minimal adaptive routing
Offloading in SmartNIC/DPU, host hardware, switch
Scale-out and scale-up network convergence
Programmable networks for AI workload
In-network computing techniques and protocols for distributed training and MPI support
Application-aware networking for AI training and inference
Collective communication optimization
Networking for cross-DC learning
Network optimization for inference
Convergence of computing, storage, and networking
Automated and intelligent AI DCN OAM
LLM for DCN OAM
Fault prediction, detection, and root cause analysis
New measurement and telemetry metrics and methods
Green data center for energy efficiency
Traffic characterization for AI workload
Network simulation and benchmarking for AI workload

Submission Instructions

We invite researchers and practitioners to submit original research papers, including position papers on disruptive ideas, and early-stage work with a potential for full papers in the future.

Reviewing will be double-blind. Authors must make a good faith effort to anonymize their submissions. Papers must not include author names and affiliations, and avoid implicitly disclosing the authors’ identity (e.g., via self-citation, funding acknowledgments).

We accept two types of submissions:

Regular research papers of up to 6 pages, excluding references and appendices. Submissions must be original, unpublished work, and not under consideration at another conference or journal. Authors of accepted submissions are expected to present their work at the workshop. Accepted submissions will be included in the workshop proceedings.
Extended abstracts of up to 2 pages, excluding references, in the same format as the regular papers. Submissions are about early-stage works and position papers that are still in progress, for authors to showcase their preliminary ideas to get early-stage feedback at the workshop. The authors are expected to present their work in the form of a lighting talk and/or poster during the workshop. Authors of accepted submissions will have the option to opt out from including the submissions in the workshop proceedings.

Please submit your paper via https://naic25.hotcrp.com/

Best paper award

The NAIC workshop will feature a best paper award sponsored by the SUPER research project, a part of the RESTART research program. See https://www.fondazione-restart.it/ for more details.

Formatting

Submissions must be printable PDF files. When creating your submission, you must use the sigconf proceedings template (two-column format, 10-pt font size) available on the official ACM site. LaTeX submissions should use the acmart.cls template (sigconf option), with the 10-pt font.

Your main LaTeX file should have the following structure:

% use the base acmart.cls
% use the sigconf proceeding template with 10 pt fonts
% nonacm option removes ACM related text in the submission.
\documentclass[sigconf,nonacm,10pt]{acmart}

% enable page numbers
\settopmatter{printfolios=true}

\begin{document}
\title{...}

\begin{abstract}
...
\end{abstract}

\maketitle % should come after the abstract

% add the paper content here

% use the ACM bibliography style
\bibliographystyle{ACM-Reference-Format}
\bibliography{...}

\end{document}

Important Dates

Submission deadline	May 22nd, 2025 (Updated)
Acceptance notification	June 25th, 2025 (Updated)
Camera-ready deadline	July 23rd, 2025 (Updated)
Workshop date	September 8th, 2025

Contact the Workshop Organizers

Organizers

General Co-Chairs	Institution
Marco Canini	King Abdullah University of Science and Technology
Chen Tian	Nanjing University
Mario Baldi	NVIDIA
Stefano Salsano	University of Rome Tor Vergata
Steering Committee Co-Chairs	Institution
Theophilus A. Benson	CMU
Torsten Hoefler	ETH Zurich
TV Lakshman	Nokia Bell Labs
Haoyu Song	Futurewei
Ying Zhang	Meta
Zhi-li Zhang	University of Minnesota
Technical Program Committee	Institution
Hitesh Ballani	Microsoft Research
Ayan Banerjee	Cisco
Brad Beckmann	AMD
Didier Colle	IMEC - Ghent University
Daniele De Sensi	La Sapienza University
Alex Galis	University College London
Keqiang He	Shanghai Jiao Tong University
Qianyi Huang	Sun Yat-sen University (SYSU)
Myeongjae Jeon	Postech
Holger Karl	HPI - University of Potsdam
Marios Kogias	Imperial College
Bingyang Liu	Huawei
Alan Lo	NVIDIA
Pierpaolo Loreti	University of Rome Tor Vergata
Qingkai Meng	Nanjing University
Shrijeet Mukherjee	Enfabrica
Gregorio Procissi	University of Pisa
Dario Rossi	Huawei
Amedeo Sapio	NVIDIA
Muhammad Shahbaz	University of Michigan
Giuseppe Siracusano	NEC Laboratories Europe
Balasz Sonkoly	Budapest University of Technology and Economics
Tao Sun	China Mobile
Yinben Xia	Tencent
Yongqiang Xiong	Microsoft Research
Zhiying Xu	Amazon
Zhimeng Yin	City University of Hong Kong
Ennan Zhai	Alibaba
Shizhen Zhao	Shanghai Jiao Tong University
Yang Zhou	UC Davis