2nd Workshop on Networks for AI Computing (NAIC)

Monday, September 8th | Full-day Workshop

Location

The workshop will take place at Room Ines de Castro.

Program

08:00 — 08:45 | Registration

08:45 — 09:00 | Welcoming Remarks

09:00 — 09:55 | Keynote 1: Improving the Performance and Resiliency for Large-Scale Distributed Training | Minlan Yu (Harvard University) Video

Abstract: The increasing scale of distributed training presents two major challenges: First, synchronization across the network incurs significant overhead to the training time; Second, the frequency of failures increases, and a single faulty machine can halt the entire training task for hours and thus requires fast detection and recovery. To tackle these challenges, it is critical to enable closer integrations of systems and machine learning. Networks should be optimized towards ML-level performance and accuracy, rather than relying on network-level metrics. Similarly, fast fault localization, diagnosis, and recovery require coordinating upper-level ML logics and the underlying system data and actions. In this talk, we will show a few examples of such integration in our recent work.

Biography: Minlan Yu is a Gordon McKay professor at the Harvard School of Engineering and Applied Science. She's the assistant director of the SRC/DARPA JUMP 2.0 ACE Center for Evolvable Computing and the co-director for the Harvard Power and AI initiative. She received her B.A. in computer science and mathematics from Peking University and her M.A. and PhD in computer science from Princeton University. She received the ACM-W Rising Star Award, NSF CAREER award, and ACM SIGCOMM doctoral dissertation award. She served as PC co-chair for SIGCOMM, NSDI, HotNets, and several other conferences and workshops.

09:55 — 10:30 | Session 1: Taming AI Networking | Chair: Chen Tian

09:55 - 10:10

Why choose when you can have both: Programmable data planes meet programmable optics

Chris Misa (University of Oregon), Matthew Nance-Hall (Scala Computing, Inc.), Reza Rejaie (University of Oregon), Walter Willinger (NIKSUN, Inc.), Ramakrishnan Durairajan (University of Oregon and Link Oregon)

10:10 - 10:20

Sparse Collectives: Exploiting Data Sparsity to Improve Communication Efficiency (Short)

Dhananjaya Wijerathne, Haris Javaid, Guanwen Zhong, Dan Wu, Xing Yuan Kom, Mario Baldi (AMD)

10:20 - 10:30

Codesign of Tensors Encoding And Transcoding: A Building Block For Decentralized AI (Short) Video

Revant Teotia, Muhammad Haseeb (New York University)

10:30 — 11:00 | Morning coffee break

11:00 — 11:40 | Session 2: SimCity for AI Systems | Chair: Qingkai Meng

11:00 - 11:15

NSX: Large-Scale Network Simulation on an AI Server

Sajy Khashab, Hariharan Sezhiyan, Rani Abboud, Alex Normatov, Stefan Kaestle, Eliav Bar-Ilan, Mohammad Nassar, Omer Shabtai, Wei Bai, Matty Kadosh, Jiarong Xing (Rice University), Mark Silberstein (NVIDIA and Technion), T. S. Eugene Ng (Rice University), Ang Chen (NVIDIA and University of Michigan)

11:15 - 11:30

MLSynth: Towards Synthetic ML Traces

Adel Sefiane (NVIDIA and Imperial College London), Alireza Farshin (NVIDIA), Marios Kogias (Imperial College London)

11:30 - 11:40

Simulating LLM training workloads for heterogeneous compute and network infrastructure (Short)

Sumit Kumar, Arjun Temura, Naman Sharma, Ramanjeet Singh (Indraprastha Institute of Information Technology Delhi), Meet Dadhania, Praveen Tammana (Indian Institute of Technology Hyderabad), Satananda Burla, Abed Mohammad Kamaluddin (Marvell Technology Inc.), Rinku Shah (Indraprastha Institute of Information Technology Delhi)

11:40 — 12:45 | Session 3: Optimizing the AI Journey | Chair: Stefano Salsano

11:40 - 11:50

RTT- or Bandwidth-Bound? Demystifying the KV Cache Transfer in Large Language Model Serving (Short)

Shengnan Yue (China Mobile), Mowei Wang (Huawei Technologies), Yu Yan (China Mobile), Weiqiang Cheng (China Mobile), Zihan Jiang (Huawei Technologies), Zhenhui Zhang (China Mobile and Nanjing University)

11:50 - 12:05

Best Paper Award

Chronos: Prescheduled Circuit Switching for LLM Training

Sundararajan Renganathan, Nick McKeown (Stanford University)

12:05 - 12:15

Intent Fuel Station: A RAG-Enhanced Agent Hub for Realizing Networking Intents (Short)

Yunxiao Ma, Peilong Zhang, Hua Li, Yufeng Zhao, Yintan Ai, Hanlin Liu (Inner Mongolia University)

12:15 - 12:30

LAPS: Joint Load Balancing and Congestion Control on Unequal-cost Multi-path Data Center Networks

Ying Wan (Southeast University), Haoyu Song (Futurewei Technologies), Yu Jia, Yunhui Yang (China Mobile), Tao Huang (Purple Mountain Laboratories), Zhikang Chen (Tsinghua University)

12:30 - 12:45

Sparkle: Optimizing the Serverless AIGC Deployment over Crowdsourced Edge Environments Video

Kaile Zhu, Shihao Shen (Tianjin University), Tuo Zhang (University of Southern California), Xiaofei Wang (Tianjin University), Xiaoliang Wang (Nanjing University), Xin Jiang, Wenyu Wang (Paiou Cloud Computing), Hai Jin (Huazhong University of Science and Technology)

12:45 — 14:00 | Lunch Break

14:05 — 14:55 | Keynote 2: Cross-Layer Innovations in Network Design for AI at Meta | Ying Zhang (Meta) Video

Abstract: As AI workloads continue to scale at Meta, the network infrastructure must evolve to meet the demanding communication patterns and performance requirements of large-scale distributed training and inference. This talk presents a cross-layer perspective on network innovations tailored for AI systems at Meta, highlighting key research advances that span from datacenter topology design, to congestion control, to multi-tenant collective communication and GPU communication over multiple NICs. Together, these works demonstrate the power of a cross-layer approach—integrating network architecture, system software, and hardware capabilities—to advance the state of AI networking at Meta.

Biography: Ying Zhang is a Senior Engineering Manager at Meta. She leads an engineering team to support the world's largest Backbone Network. She works on large scale network management problems and her research interests are in software-defined networks, network function virtualization, network monitoring, Internet routing, and network security. She has 100+ granted US/International patents, 150+ peer-reviewed publications with over 8000 citations.

15:00 — 15:45 | Session 4: LLMs Without Borders | Chair: Haoyu Song

15:00 - 15:15

A Cloud-Edge Collaborative Inference System for Data-secure LLM Serving

Wenjie Chu, Chunhui Du, Yunfeng Shao (Huawei Technologies)

15:15 - 15:30

LLMs on Edge: Network Traffic Characteristics of Distributed Inference under the Loupe

Philippe Buschmann, Arne Bröring (Siemens AG), Georg Carle (Technical University of Munich), Andreas Blenk (Siemens AG)

15:30 - 15:45

LIFT: Automating Symbolic Execution Optimization with Large Language Models for AI Networks Video

Ruoxi Wang (Northeastern University), Kun Li, Minghui Xu, Yue Zhang (Shandong University), Kaidi Xu (Drexel University), Chunchi Liu (Huawei Technologies), Xiuzhen Cheng (Shandong University), Yinhao Xiao (Guangdong University of Finance and Economics)

15:45 — 16:15 | Afternoon Coffee Break

16:15 — 17:15 | Session 5: Smart Moves in AI Networking | Chair: Marios Kogias

16:15 - 16:30

Reconfigurability within Collective Communication Algorithms

Rukshani Athapathu, George Porter (University of California San Diego)

16:30 - 16:45

T3P: Topology-Tailored Tensor Parallelism

Saar Ben-Yochana, Chen Avin, Gabriel Scalosub (Ben Gurion University of the Negev)

16:45 - 17:00

Quantifying the Impact of Job Placement and Routing on Network Efficiency in AI Clusters

Dante Van Poucke, Didier Colle, Mario Pickavet, Wouter Tavernier (Ghent University - Imec)

17:00 - 17:15

AMSO-INT: Reinforcement Learning-Driven Dynamic Adaptive In-Band Telemetry

Yuqi Li, Tianyu Chen (Jiangsu University), Vladimir Ciric (University of Nis), Changda Wang (Jiangsu University)

17:15 — 18:00 | Panel Discussion: Will AI Eat the Network… or Just Snack on It? | Chair: Haoyu Song (Futurewei) | Panelists: Video

  • Kai Chen (HKUST)
  • Dongsu Han (KAIST)
  • Ramana Kompella (Cisco)
  • Nick McKeown (Stanford)
  • Mark Silberstein (Technion)
  • Keyi Zhu (Huawei)

Network challenges and opportunities in the age of agentic AI and Internet of Agents



Call for Papers

Generative AI is transforming many aspects of modern society with content ranging from text and image to videos. The Large Language Models (LLMs) and other Artificial Intelligence (AI)/Machine Learning (ML) that enable these generative AI capabilities are placing an unprecedented amount of pressure on modern data centers with anecdotal evidence suggesting that the largest model can take months to train. To support these models, modern distributed training clusters contain tens of thousands of GPUs/TPUs, with many expecting the scale to further increase significantly.


More fundamentally, training these large models introduces network communication patterns that require sophisticated and novel topology, routing, and synchronization. As the adoption and use of such models grows, the data generated and the data required to train and make inferences will place emphasis on the design of novel network primitives. The scale, workload, and performance requirements require us to reconsider every layer of the network stacks and scrutinize the solution from a holistic perspective. The recent industry initiative, Ultra Ethernet Consortium (UEC), is actively working on Ethernet-based network optimizations for AI and HPC workloads. The Open Compute Project (OCP) is geared more toward infrastructure support for AI computing. The standard organizations (e.g., IETF) are also seeking opportunities in networking for AI computing. We believe the networking research community should take a bolder position and bring cutting-edge innovations in this front as well.


The workshop aims to bring together researchers and experts from academia and industry to share the latest research, trends, and challenges in cloud and data center networks for AI computing. We expect it to enrich our understanding of the AI workloads, communication patterns, and their impacts on networks, and help the community to identify future research directions. We encourage lively debate on issues like convergence vs disaggregation, front-end vs back-end, smart edges vs programmable core, and the need for new interconnection, new topology, new transport, and new routing algorithms and protocols.

Topics of Interest

Topics of interest include, but are not limited to:

Submission Instructions

We invite researchers and practitioners to submit original research papers, including position papers on disruptive ideas, and early-stage work with a potential for full papers in the future.


Reviewing will be double-blind. Authors must make a good faith effort to anonymize their submissions. Papers must not include author names and affiliations, and avoid implicitly disclosing the authors’ identity (e.g., via self-citation, funding acknowledgments).


We accept two types of submissions:


Please submit your paper via https://naic25.hotcrp.com/

Best paper award

The NAIC workshop will feature a best paper award sponsored by the SUPER research project, a part of the RESTART research program. See https://www.fondazione-restart.it/ for more details.

Formatting

Submissions must be printable PDF files. When creating your submission, you must use the sigconf proceedings template (two-column format, 10-pt font size) available on the official ACM site. LaTeX submissions should use the acmart.cls template (sigconf option), with the 10-pt font.


Your main LaTeX file should have the following structure:


% use the base acmart.cls
% use the sigconf proceeding template with 10 pt fonts
% nonacm option removes ACM related text in the submission.
\documentclass[sigconf,nonacm,10pt]{acmart}

% enable page numbers
\settopmatter{printfolios=true}

\begin{document}
\title{...}

\begin{abstract}
...
\end{abstract}

\maketitle % should come after the abstract

% add the paper content here

% use the ACM bibliography style
\bibliographystyle{ACM-Reference-Format}
\bibliography{...}

\end{document}

Important Dates

Submission deadline May 22nd, 2025 (Updated)
Acceptance notification June 25th, 2025 (Updated)
Camera-ready deadline July 23rd, 2025 (Updated)
Workshop date September 8th, 2025

Organizers

General Co-Chairs Institution
Marco Canini King Abdullah University of Science and Technology
Chen Tian Nanjing University
Mario Baldi NVIDIA
Stefano Salsano University of Rome Tor Vergata
Steering Committee Co-Chairs Institution
Theophilus A. Benson CMU
Torsten Hoefler ETH Zurich
TV Lakshman Nokia Bell Labs
Haoyu Song Futurewei
Ying Zhang Meta
Zhi-li Zhang University of Minnesota
Technical Program Committee Institution
Hitesh Ballani Microsoft Research
Ayan Banerjee Cisco
Brad Beckmann AMD
Didier Colle IMEC - Ghent University
Daniele De Sensi La Sapienza University
Alex Galis University College London
Keqiang He Shanghai Jiao Tong University
Qianyi Huang Sun Yat-sen University (SYSU)
Myeongjae Jeon Postech
Holger Karl HPI - University of Potsdam
Marios Kogias Imperial College
Bingyang Liu Huawei
Alan Lo NVIDIA
Pierpaolo Loreti University of Rome Tor Vergata
Qingkai Meng Nanjing University
Shrijeet Mukherjee Enfabrica
Gregorio Procissi University of Pisa
Dario Rossi Huawei
Amedeo Sapio NVIDIA
Muhammad Shahbaz University of Michigan
Giuseppe Siracusano NEC Laboratories Europe
Balasz Sonkoly Budapest University of Technology and Economics
Tao Sun China Mobile
Yinben Xia Tencent
Yongqiang Xiong Microsoft Research
Zhiying Xu Amazon
Zhimeng Yin City University of Hong Kong
Ennan Zhai Alibaba
Shizhen Zhao Shanghai Jiao Tong University
Yang Zhou UC Davis