2nd Workshop on Networks for AI Computing (NAIC)

Monday, September 8th | Full-day Workshop

Location
The workshop will take place at Room Ines de Castro.

Program
|
08:00 — 08:45 | Registration
|
08:45 — 09:00 | Welcoming Remarks
|
09:00 — 09:55 | Keynote 1: Improving the Performance and Resiliency for Large-Scale Distributed Training | Minlan Yu (Harvard University)
Abstract: The increasing scale of distributed training presents two major challenges: First, synchronization across the network incurs significant overhead to the training time; Second, the frequency of failures increases, and a single faulty machine can halt the entire training task for hours and thus requires fast detection and recovery. To tackle these challenges, it is critical to enable closer integrations of systems and machine learning. Networks should be optimized towards ML-level performance and accuracy, rather than relying on network-level metrics. Similarly, fast fault localization, diagnosis, and recovery require coordinating upper-level ML logics and the underlying system data and actions. In this talk, we will show a few examples of such integration in our recent work.
Biography: Minlan Yu is a Gordon McKay professor at the Harvard School of Engineering and Applied Science. She's the assistant director of the SRC/DARPA JUMP 2.0 ACE Center for Evolvable Computing and the co-director for the Harvard Power and AI initiative. She received her B.A. in computer science and mathematics from Peking University and her M.A. and PhD in computer science from Princeton University. She received the ACM-W Rising Star Award, NSF CAREER award, and ACM SIGCOMM doctoral dissertation award. She served as PC co-chair for SIGCOMM, NSDI, HotNets, and several other conferences and workshops.
|
09:55 — 10:30 | Session 1: Taming AI Networking | Chair: Chen Tian
09:55 - 10:10
Why choose when you can have both: Programmable data planes meet programmable optics
Chris Misa (University of Oregon), Matthew Nance-Hall (Scala Computing, Inc.), Reza Rejaie (University of Oregon), Walter Willinger (NIKSUN, Inc.), Ramakrishnan Durairajan (University of Oregon and Link Oregon)
10:10 - 10:20
Sparse Collectives: Exploiting Data Sparsity to Improve Communication Efficiency (Short)
Dhananjaya Wijerathne, Haris Javaid, Guanwen Zhong, Dan Wu, Xing Yuan Kom, Mario Baldi (AMD)
10:20 - 10:30
Codesign of Tensors Encoding And Transcoding: A Building Block For Decentralized AI (Short)
Revant Teotia, Muhammad Haseeb (New York University)
|
|
10:30 — 11:00 | Morning coffee break
|
11:00 — 11:40 | Session 2: SimCity for AI Systems | Chair: Qingkai Meng
11:00 - 11:15
NSX: Large-Scale Network Simulation on an AI Server
Sajy Khashab, Hariharan Sezhiyan, Rani Abboud, Alex Normatov, Stefan Kaestle, Eliav Bar-Ilan, Mohammad Nassar, Omer Shabtai, Wei Bai, Matty Kadosh, Jiarong Xing (Rice University), Mark Silberstein (NVIDIA and Technion), T. S. Eugene Ng (Rice University), Ang Chen (NVIDIA and University of Michigan)
11:15 - 11:30
MLSynth: Towards Synthetic ML Traces
Adel Sefiane (NVIDIA and Imperial College London), Alireza Farshin (NVIDIA), Marios Kogias (Imperial College London)
11:30 - 11:40
Simulating LLM training workloads for heterogeneous compute and network infrastructure (Short)
Sumit Kumar, Arjun Temura, Naman Sharma, Ramanjeet Singh (Indraprastha Institute of Information Technology Delhi), Meet Dadhania, Praveen Tammana (Indian Institute of Technology Hyderabad), Satananda Burla, Abed Mohammad Kamaluddin (Marvell Technology Inc.), Rinku Shah (Indraprastha Institute of Information Technology Delhi)
|
11:40 — 12:45 | Session 3: Optimizing the AI Journey | Chair: Stefano Salsano
11:40 - 11:50
RTT- or Bandwidth-Bound? Demystifying the KV Cache Transfer in Large Language Model Serving (Short)
Shengnan Yue (China Mobile), Mowei Wang (Huawei Technologies), Yu Yan (China Mobile), Weiqiang Cheng (China Mobile), Zihan Jiang (Huawei Technologies), Zhenhui Zhang (China Mobile and Nanjing University)
11:50 - 12:05
Best Paper Award
Chronos: Prescheduled Circuit Switching for LLM Training
Sundararajan Renganathan, Nick McKeown (Stanford University)
12:05 - 12:15
Intent Fuel Station: A RAG-Enhanced Agent Hub for Realizing Networking Intents (Short)
Yunxiao Ma, Peilong Zhang, Hua Li, Yufeng Zhao, Yintan Ai, Hanlin Liu (Inner Mongolia University)
12:15 - 12:30
LAPS: Joint Load Balancing and Congestion Control on Unequal-cost Multi-path Data Center Networks
Ying Wan (Southeast University), Haoyu Song (Futurewei Technologies), Yu Jia, Yunhui Yang (China Mobile), Tao Huang (Purple Mountain Laboratories), Zhikang Chen (Tsinghua University)
12:30 - 12:45
Sparkle: Optimizing the Serverless AIGC Deployment over Crowdsourced Edge Environments
Kaile Zhu, Shihao Shen (Tianjin University), Tuo Zhang (University of Southern California), Xiaofei Wang (Tianjin University), Xiaoliang Wang (Nanjing University), Xin Jiang, Wenyu Wang (Paiou Cloud Computing), Hai Jin (Huazhong University of Science and Technology)
|
|
12:45 — 14:00 | Lunch Break
|
14:05 — 14:55 | Keynote 2: Cross-Layer Innovations in Network Design for AI at Meta | Ying Zhang (Meta)
Abstract: As AI workloads continue to scale at Meta, the network infrastructure must evolve to meet the demanding communication patterns and performance requirements of large-scale distributed training and inference. This talk presents a cross-layer perspective on network innovations tailored for AI systems at Meta, highlighting key research advances that span from datacenter topology design, to congestion control, to multi-tenant collective communication and GPU communication over multiple NICs. Together, these works demonstrate the power of a cross-layer approach—integrating network architecture, system software, and hardware capabilities—to advance the state of AI networking at Meta.
Biography: Ying Zhang is a Senior Engineering Manager at Meta. She leads an engineering team to support the world's largest Backbone Network. She works on large scale network management problems and her research interests are in software-defined networks, network function virtualization, network monitoring, Internet routing, and network security. She has 100+ granted US/International patents, 150+ peer-reviewed publications with over 8000 citations.
|
15:00 — 15:45 | Session 4: LLMs Without Borders | Chair: Haoyu Song
15:00 - 15:15
A Cloud-Edge Collaborative Inference System for Data-secure LLM Serving
Wenjie Chu, Chunhui Du, Yunfeng Shao (Huawei Technologies)
15:15 - 15:30
LLMs on Edge: Network Traffic Characteristics of Distributed Inference under the Loupe
Philippe Buschmann, Arne Bröring (Siemens AG), Georg Carle (Technical University of Munich), Andreas Blenk (Siemens AG)
15:30 - 15:45
LIFT: Automating Symbolic Execution Optimization with Large Language Models for AI Networks
Ruoxi Wang (Northeastern University), Kun Li, Minghui Xu, Yue Zhang (Shandong University), Kaidi Xu (Drexel University), Chunchi Liu (Huawei Technologies), Xiuzhen Cheng (Shandong University), Yinhao Xiao (Guangdong University of Finance and Economics)
|
|
15:45 — 16:15 | Afternoon Coffee Break
|
16:15 — 17:15 | Session 5: Smart Moves in AI Networking | Chair: Marios Kogias
16:15 - 16:30
Reconfigurability within Collective Communication Algorithms
Rukshani Athapathu, George Porter (University of California San Diego)
16:30 - 16:45
T3P: Topology-Tailored Tensor Parallelism
Saar Ben-Yochana, Chen Avin, Gabriel Scalosub (Ben Gurion University of the Negev)
16:45 - 17:00
Quantifying the Impact of Job Placement and Routing on Network Efficiency in AI Clusters
Dante Van Poucke, Didier Colle, Mario Pickavet, Wouter Tavernier (Ghent University - Imec)
17:00 - 17:15
AMSO-INT: Reinforcement Learning-Driven Dynamic Adaptive In-Band Telemetry
Yuqi Li, Tianyu Chen (Jiangsu University), Vladimir Ciric (University of Nis), Changda Wang (Jiangsu University)
|
17:15 — 18:00 | Panel Discussion: Will AI Eat the Network… or Just Snack on It? | Chair: Haoyu Song (Futurewei) | Panelists:
- Kai Chen (HKUST)
- Dongsu Han (KAIST)
- Ramana Kompella (Cisco)
- Nick McKeown (Stanford)
- Mark Silberstein (Technion)
- Keyi Zhu (Huawei)
Network challenges and opportunities in the age of agentic AI and Internet of Agents
|

Call for Papers
Generative AI is transforming many aspects of modern society with content ranging from text and image to videos. The Large Language Models (LLMs) and other Artificial Intelligence (AI)/Machine Learning (ML) that enable these generative AI capabilities are placing an unprecedented amount of pressure on modern data centers with anecdotal evidence suggesting that the largest model can take months to train. To support these models, modern distributed training clusters contain tens of thousands of GPUs/TPUs, with many expecting the scale to further increase significantly.
More fundamentally, training these large models introduces network communication patterns that require sophisticated and novel topology, routing, and synchronization. As the adoption and use of such models grows, the data generated and the data required to train and make inferences will place emphasis on the design of novel network primitives. The scale, workload, and performance requirements require us to reconsider every layer of the network stacks and scrutinize the solution from a holistic perspective. The recent industry initiative, Ultra Ethernet Consortium (UEC), is actively working on Ethernet-based network optimizations for AI and HPC workloads. The Open Compute Project (OCP) is geared more toward infrastructure support for AI computing. The standard organizations (e.g., IETF) are also seeking opportunities in networking for AI computing. We believe the networking research community should take a bolder position and bring cutting-edge innovations in this front as well.
The workshop aims to bring together researchers and experts from academia and industry to share the latest research, trends, and challenges in cloud and data center networks for AI computing. We expect it to enrich our understanding of the AI workloads, communication patterns, and their impacts on networks, and help the community to identify future research directions. We encourage lively debate on issues like convergence vs disaggregation, front-end vs back-end, smart edges vs programmable core, and the need for new interconnection, new topology, new transport, and new routing algorithms and protocols.

Topics of Interest
Topics of interest include, but are not limited to:
- Technologies for RDMA and Ethernet efficiency, performance, security, and extensibility
- Load balancing for distributed learning
- Lossless and loss-tolerant network design
- Host and network integration and coordination
- New transport protocols and congestion control for AI training
- Programmable congestion control
- New network architecture and topologies for AI and HPC
- Non-minimal adaptive routing
- Offloading in SmartNIC/DPU, host hardware, switch
- Scale-out and scale-up network convergence
- Programmable networks for AI workload
- In-network computing techniques and protocols for distributed training and MPI support
- Application-aware networking for AI training and inference
- Collective communication optimization
- Networking for cross-DC learning
- Network optimization for inference
- Convergence of computing, storage, and networking
- Automated and intelligent AI DCN OAM
- LLM for DCN OAM
- Fault prediction, detection, and root cause analysis
- New measurement and telemetry metrics and methods
- Green data center for energy efficiency
- Traffic characterization for AI workload
- Network simulation and benchmarking for AI workload

Submission Instructions
We invite researchers and practitioners to submit original research papers, including position papers on disruptive ideas, and early-stage work with a potential for full papers in the future.
Reviewing will be double-blind. Authors must make a good faith effort to anonymize their submissions. Papers must not include author names and affiliations, and avoid implicitly disclosing the authors’ identity (e.g., via self-citation, funding acknowledgments).
We accept two types of submissions:
- Regular research papers of up to 6 pages, excluding references and appendices. Submissions must be original, unpublished work, and not under consideration at another conference or journal. Authors of accepted submissions are expected to present their work at the workshop. Accepted submissions will be included in the workshop proceedings.
- Extended abstracts of up to 2 pages, excluding references, in the same format as the regular papers. Submissions are about early-stage works and position papers that are still in progress, for authors to showcase their preliminary ideas to get early-stage feedback at the workshop. The authors are expected to present their work in the form of a lighting talk and/or poster during the workshop. Authors of accepted submissions will have the option to opt out from including the submissions in the workshop proceedings.
Please submit your paper via https://naic25.hotcrp.com/
Best paper award
The NAIC workshop will feature a best paper award sponsored by the SUPER research project, a part of the RESTART research program. See https://www.fondazione-restart.it/ for more details.
Formatting
Submissions must be printable PDF files. When creating your submission, you must use the sigconf proceedings template (two-column format, 10-pt font size) available on the official ACM site. LaTeX submissions should use the acmart.cls template (sigconf option), with the 10-pt font.
Your main LaTeX file should have the following structure:
% use the base acmart.cls
% use the sigconf proceeding template with 10 pt fonts
% nonacm option removes ACM related text in the submission.
\documentclass[sigconf,nonacm,10pt]{acmart}
% enable page numbers
\settopmatter{printfolios=true}
\begin{document}
\title{...}
\begin{abstract}
...
\end{abstract}
\maketitle % should come after the abstract
% add the paper content here
% use the ACM bibliography style
\bibliographystyle{ACM-Reference-Format}
\bibliography{...}
\end{document}

Important Dates
|
Submission deadline
|
May 22nd, 2025 (Updated)
|
|
Acceptance notification
|
June 25th, 2025 (Updated)
|
|
Camera-ready deadline
|
July 23rd, 2025 (Updated)
|
|
Workshop date
|
September 8th, 2025
|

Organizers
| General Co-Chairs |
Institution |
| Marco Canini |
King Abdullah University of Science and Technology |
| Chen Tian |
Nanjing University |
| Mario Baldi |
NVIDIA |
| Stefano Salsano |
University of Rome Tor Vergata |
| Steering Committee Co-Chairs |
Institution |
| Theophilus A. Benson |
CMU |
| Torsten Hoefler |
ETH Zurich |
| TV Lakshman |
Nokia Bell Labs |
| Haoyu Song |
Futurewei |
| Ying Zhang |
Meta |
| Zhi-li Zhang |
University of Minnesota |
| Technical Program Committee |
Institution |
| Hitesh Ballani |
Microsoft Research |
| Ayan Banerjee |
Cisco |
| Brad Beckmann |
AMD |
| Didier Colle |
IMEC - Ghent University |
| Daniele De Sensi |
La Sapienza University |
| Alex Galis |
University College London |
| Keqiang He |
Shanghai Jiao Tong University |
| Qianyi Huang |
Sun Yat-sen University (SYSU) |
| Myeongjae Jeon |
Postech |
| Holger Karl |
HPI - University of Potsdam |
| Marios Kogias |
Imperial College |
| Bingyang Liu |
Huawei |
| Alan Lo |
NVIDIA |
| Pierpaolo Loreti |
University of Rome Tor Vergata |
| Qingkai Meng |
Nanjing University |
| Shrijeet Mukherjee |
Enfabrica |
| Gregorio Procissi |
University of Pisa |
| Dario Rossi |
Huawei |
| Amedeo Sapio |
NVIDIA |
| Muhammad Shahbaz |
University of Michigan |
| Giuseppe Siracusano |
NEC Laboratories Europe |
| Balasz Sonkoly |
Budapest University of Technology and Economics |
| Tao Sun |
China Mobile |
| Yinben Xia |
Tencent |
| Yongqiang Xiong |
Microsoft Research |
| Zhiying Xu |
Amazon |
| Zhimeng Yin |
City University of Hong Kong |
| Ennan Zhai |
Alibaba |
| Shizhen Zhao |
Shanghai Jiao Tong University |
| Yang Zhou |
UC Davis |