Welcome to the hub for cutting-edge research in computer networks and communications! Here you will find an overview of the ACM SIGCOMM 2025 conference accepted papers.
SIGCOMM'25 Best Paper Award
Mosaic: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs | Technical Session: Hardware
Kaoutar Benyahya, Ariel Gomez Diaz, Junyi Liu, Vassily Lyutsarev, Marianna Pantouvaki, Kai Shi, Shawn Yohanes Siew, Hitesh Ballani, Thomas Burridge, Daniel Cletheroe, Thomas Karagiannis, Brian Robertson, Ant Rowstron, Mengyang Yang (Microsoft Research); Arash Behziz, Jamie Gaudette (Microsoft Azure); Paolo Costa (Microsoft Research)
SIGCOMM'25 Best Student Paper Award
Edge Caching as Differentiation | Technical Session: Measurements
Muhammad Abdullah (EPFL); Mughees Ur Rehman (Virginia Tech); Pavlos Nikolopoulos, Katerina Argyraki (EPFL)
SIGCOMM'25 Best Student Paper Award (Honorable Mention)
Wenxue Li (Hong Kong University of Science and Technology and Huawei); Xiangzhou Liu, Yunxuan Zhang, Zihao Wang (Hong Kong University of Science and Technology); Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren (Huawei); Xinyang Huang, Zhenghang Ren, Bowen Liu, Junxue Zhang, Kai Chen (Hong Kong University of Science and Technology); Bingyang Liu (Huawei)
Papers Info
Day 2 - Tuesday, September 9 2025
09:00 — 10:20 | NetAI
Session Chair: Jon Crowcroft
InfiniteHBD: Building Datacenter-Scale High-Bandwidth Domain for LLM with Optical Circuit Switching Transceivers
Abstract: Scaling Large Language Model (LLM) training relies on multi-dimensional parallelism, where High-Bandwidth Domains (HBDs) are critical for communication-intensive parallelism like Tensor Parallelism. However, existing HBD architectures face fundamental limitations in scalability, cost, and fault resiliency: switch-centric HBDs (e.g., NVL-72) incur prohibitive scaling costs, while GPU-centric HBDs (e.g., TPUv3/Dojo) suffer from severe fault propagation. Switch-GPU hybrid HBDs (e.g., TPUv4) take a middle-ground approach, but the fault explosion radius remains large.
We propose InfiniteHBD, a transceiver-centric HBD architecture that integrates connectivity and dynamic switching at the transceiver level by embedding Optical Circuit Switching (OCS) within each transceiver. It enables reconfigurable point-to-multipoint communication and scalable variable-size ring topologies. InfiniteHBD achieves datacenter-scale scalability without cost explosion, fault isolation at the node level, and full bandwidth utilization for healthy GPUs. Key innovations include a Silicon Photonic-based OCS transceiver (OCSTrx), a reconfigurable k-hop ring topology, and an HBD-DCN orchestration algorithm. The evaluation demonstrates that InfiniteHBD reduces cost to 31% of NVL-72, achieves a near-zero GPU waste ratio (over 10x lower than NVL-72 and TPUv4), maintains near-zero cross-ToR traffic under 7% node fault ratio, and improves Model FLOPs Utilization by 3.37x compared to NVIDIA DGX (8 GPUs/node).
DistTrain: Addressing Model and Data Heterogeneity with Disaggregated Training for Multimodal Large Language Models
Abstract: Multimodal large language models (LLMs) empower LLMs to ingest inputs and generate outputs in multiple forms, such as text, image, and audio. However, the integration of multiple modalities introduces heterogeneity in both the model and training data, creating unique systems challenges.
We propose DistTrain, a disaggregated training system for multimodal LLMs. DistTrain incorporates two novel disaggregation techniques to address model and data heterogeneity, respectively. The first is disaggregated model orchestration, which separates the training for modality encoder, LLM backbone, and modality generator. This allows the three components to adaptively and independently orchestrate their resources and parallelism configurations. The second is disaggregated data preprocessing, which decouples data preprocessing from training. This eliminates resource contention between preprocessing and training, and enables efficient data reordering to mitigate stragglers within and between microbatches caused by data heterogeneity. We evaluate DistTrain across different sizes of multimodal LLMs on a large-scale production cluster. The experimental results show that DistTrain achieves 54.7% Model FLOPs Utilization (MFU) when training a 72B multimodal LLM on 1172 GPUs and outperforms Megatron-LM by up to 2.2×× on training throughput.
SCX: Stateless KV-Cache Encoding for Cloud-Scale Confidential Transformer Serving
Mu Yuan (The Chinese University of Hong Kong); Lan Zhang (University of Science and Technology of China); Liekang Zeng, Siyang Jiang, Bufang Yang, Di Duan, Guoliang Xing (The Chinese University of Hong Kong)
Abstract: Transformer models have revolutionized fields like natural language processing and computer vision but face privacy concerns in sensitive applications such as medical diagnostics. Existing confidential serving methods, including cryptography-based, memory isolation-based, and access control-based, offer trade-offs between privacy and efficiency but often struggle with high latency or hardware dependencies. This work proposes stateless KV-cache encoding (SCX), a novel framework that encodes the intermediate key-value cache during Transformer inference using user-controlled keys. SCX ensures that the cloud can neither recover the input nor independently complete the next token prediction, effectively preserving privacy. By introducing efficient encoding and decoding schemes, SCX addresses communication complexity and attack vulnerabilities while ensuring zero loss of inference quality. Experiments on large Transformer models demonstrate that SCX achieves lower latency (e.g., 36ms for LLaMA-7B), outperforming state-of-the-art cryptography and memory isolation methods by orders of magnitude. Moreover, SCX can complementarily work with advanced KV-cache management techniques to further enhance KV-cache communication efficiency by 85%, marking a significant step toward practical, privacy-preserving large Transformer serving.
ResCCL: Resource-Efficient Scheduling for Collective Communication
Tongrui Liu, Chenyang Hei, Fuliang Li (Northeastern University); Chengxi Gao (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences); Jiamin Cao, Tianshu Wang, Ennan Zhai (Alibaba Cloud); Xingwei Wang (Northeastern University)
Abstract: As distributed deep learning training (DLT) systems scale, collective communication has become a significant performance bottleneck. While current approaches optimize bandwidth utilization and task completion time, existing communication libraries (CCLs) backends fail to efficiently manage GPU resources during algorithm execution, limiting the performance of advanced algorithms. This paper proposes ResCCL, a novel CCL backend designed for Resource-Efficient Scheduling to address key limitations in current systems. ResCCL enhances execution efficiency by optimizing scheduling at the primitive level (e.g., send, recvReduceCopy etc.), enabling flexible thread block (TB) allocation, and generating lightweight communication kernels to minimize runtime overhead. Our approach tackles the global scheduling problem, reduces idle TB resources, and enhances communication bandwidth. Evaluation results demonstrate that ResCCL achieves up to 2.5×× improvement in bandwidth performance compared to both NCCL and MSCCL. It reduces SM resource overhead by 77.8% and increases TB utilization by 41.6% while running the same algorithms. In end-to-end DLT, ResCCL boosts Megatron's throughput by up to 39%.
09:00 — 10:20 | Datacenter Networking
Session Chair: Paolo Costa
Albatross: A Containerized Cloud Gateway Platform with FPGA-accelerated Packet-level Load Balancing
Jianyuan Lu (Alibaba Cloud); Shunmin Zhu (Hangzhou Feitian Cloud and Alibaba Cloud); Jun Liang, Yuxiang Lin, Tian Pan, Yisong Qiao, Yang Song, Wenqiang Su, Yixin Xie, Yanqiang Li, Enge Song, Shize Zhang, Xiaoqing Sun, Rong Wen, Xionglie Wei, Biao Lyu (Alibaba Cloud); Xing Li (Zhejiang University and Alibaba Cloud)
Abstract: Alibaba Cloud's centralized gateways relied heavily on high-capacity switching ASICs, but the abrupt halt of Tofino chip evolution in Jan 2023 forced us to seek alternatives that can meet the requirements of performance, supply-chain security, code reuse, and resource efficiency. After evaluating multiple options, we developed Albatross, our 3rd gen cloud gateway based on FPGA and x86 CPUs. Albatross delivers FPGA-based packet-level load balancing to the host CPUs to prevent CPU core overload, manages large reorder buffers under high-latency jitters (100μμs) during complex cloud service processing, and resolves head-of-line (HOL) blocking from packet losses or software exceptions in CPUs. To avoid being overloaded by heavy hitters due to anomalies or attacks, it also implements a two-stage rate limiter for millions of tenants with only 2MB of FPGA memory. To maximize resource utilization, Albatross uses containerization to host multiple gateway instances and designs a BGP proxy to lessen the BGP peering overhead on uplink switches caused by high-density container deployments. After hundreds of man-months of development, a single Albatross node can process 80\textasciitilde120Mpps of cloud network traffic with an average latency of 20μμs, reducing gateway and sandbox infra costs by 50%.
Revisiting RDMA Reliability for Lossy Fabrics
Wenxue Li (Hong Kong University of Science and Technology and Huawei); Xiangzhou Liu, Yunxuan Zhang, Zihao Wang (Hong Kong University of Science and Technology); Wei Gu, Tao Qian, Gaoxiong Zeng, Shoushou Ren (Huawei); Xinyang Huang, Zhenghang Ren, Bowen Liu, Junxue Zhang, Kai Chen (Hong Kong University of Science and Technology); Bingyang Liu (Huawei)
Abstract: Due to the high operational complexity and limited deployment scale of lossless RDMA networks, the community has been exploring efficient RDMA communication over lossy fabrics. State-of-the-art (SOTA) lossy RDMA solutions implement a simplified selective repeat mechanism in RDMA NICs (RNICs) to enhance loss recovery efficiency. However, they still face performance challenges, such as unavoidable ECMP collision and excessive retransmission timeouts (RTOs). In this paper, we revisit RDMA reliability with the goals of being independent of PFC, compatible with packet-level load balancing, free from RTO, and friendly to hardware offloading. To this end, we propose DCP, a transport architecture that co-designs both the switch and RNICs, fully meeting the design goals. At its core, DCP-Switch introduces a simple yet effective lossless control plane, which is leveraged by DCP-RNIC to enhance reliability support for high-speed lossy fabrics, primarily including the header-only-based retransmission and bitmap-free packet tracking. We prototype DCP-Switch using Tofino2 and DCP-RNIC using FPGA. Extensive experiments demonstrate that DCP achieves 1.6×× and 2.1×× performance improvements, compared to SOTA lossless and lossy RDMA solutions, respectively.
Software-based Live Migration for RDMA
Xiaoyu Li (Tsinghua University/Microsoft Reaearch); Ran Shu, Yongqiang Xiong (Microsoft Research); Fengyuan Ren (Tsinghua University)
Abstract: Live migration is critical to ensure services are not interrupted during host maintenance in data centers. On the other hand, RDMA has been widely adopted in data centers, and has attracted both academia and industry for years. However, live migration of RDMA is not supported in today’s data centers. Although modifying RDMA NICs (RNICs) to be aware of live migration has been proposed for years, there is no sign of supporting it on commodity RNICs. This paper proposes MigrRDMA, a software-based RDMA live migration that does not rely on any extra hardware support. MigrRDMA provides a software indirection layer to achieve transparent switching to new RDMA communications. Unlike previous RDMA virtualization that provides sharing and isolation, MigrRDMA’s indirection layer focuses on keeping the RDMA states on the migration source and destination identical from the perspective of applications. We implemented MigrRDMA prototype over Mellanox RNICs. Our evaluation shows that MigrRDMA adds little downtime when migrating a container with live RDMA connections running at line rate. Besides, the MigrRDMA virtualization layer only adds 3% ∼ 9% extra overheads in the data path. When migrating Hadoop tasks, MigrRDMA only incurs an extra 3-second job completion time.
ByteDance Jakiro: Enabling RDMA and TCP over Virtual Private Cloud
Yirui Liu, Lidong Jiang, Deguo Li, Daxiang Kang, Zhaoyang Wei, Yuqi Chai, Bin Niu, Ke Lin, Xiaoning Ding, Jianwen Pi, Hao Luo (ByteDance)
Abstract: A Virtual Private Cloud that enables both RDMA and TCP provides advantages for both tenants and cloud providers. It serves the flexible demands of RDMA and TCP of tenant applications while delivering a cost-effective solution compared to the construction of two distinct overlay networks. In this study, we introduce Jakiro, an innovative framework of vNIC design that supports both RDMA and TCP within ByteDance Cloud. Jakiro holds the capability to support fundamental VPC features such as QoS, security groups, etc., for both RDMA and TCP streams while maintaining compatibility with applications and intra-host RDMA optimization techniques. We benchmark Jakiro’s performance using basic test cases and real-world high-performance computing applications and distributed machine learning training. The results indicate that the RDMA performance of Jakiro is close to that of the physical RDMA. Concurrently, Jakiro guarantees a weighted fair QoS between RDMA and TCP. Jakiro has been deployed in ByteDance Cloud for one year, we share our critical design and deployment decisions, as well as experiences and lessons from production.
10:50 — 12:10 | Measurements
Session Chair: Alex C. Snoeren
LeoCC: Making Internet Congestion Control Robust to LEO Satellite Dynamics
Zeqi Lai, Zonglun Li, Qian Wu, Hewu Li (Tsinghua University); Jihao Li (Zhongguancun Laboratory); Xin Xie, Yuanjie Li, Jun Liu, Jianping Wu (Tsinghua University)
Abstract: The recent renaissance of low Earth orbit (LEO) satellite networks expands the boundaries of global Internet access, but also introduces substantial new challenges for existing end-to-end congestion control algorithms (CCAs). The rapid and continuous movement of LEO satellites leads to infrastructure-level dynamics, resulting in frequent, LEO-dynamics-induced changes in link capacity, delay, and packet loss rate, which can further mislead the rate control in existing CCAs and cause self-limited performance.
This paper presents LeoCC, a novel CCA that addresses the above challenges and is robust to LEO satellite dynamics. The core idea behind LeoCC lies in a critical characteristic of emerging LEO networks called “connection reconfiguration”, which implicitly reflects satellite path changes and is strongly correlated to network variations. Specifically, LeoCC employs a suite of new techniques to: (i) efficiently detect reconfiguration on the endpoint; (ii) apply a reconfiguration-aware model to characterize and estimate network conditions accurately; and (iii) precisely regulate the sending rate. We implement LeoCC in Linux kernel and evaluate its performance through extensive experiments conducted in both real LEO networks and a controlled lab environment. The results show that LeoCC can achieve the best throughput-delay balance under various LEO network conditions as compared to other existing CCAs: it achieves 85∼494% higher throughput than Cubic, Copa, BBRv3, and accomplishes 44∼56% lower delay than BBRv1 and VIVACE.
Censys: A Map of Internet Hosts and Services
Zakir Durumeric, Hudson Clark, Jeff Cody, Elliot Cubit, Matt Ellison, Liz Izhikevich, Ariana Mirian (Censys Inc.)
Abstract: In 2015, we released Censys to lower the barrier to entry for researchers to study Internet devices by continually collecting and packaging Internet scan data. Since then, as we have learned more about how best to capture the complex behavior of Internet services and begun to serve commercial and government users, we have re-architected every aspect of how Censys operates. Motivated by requests from the community, we present Censys' evolution and current architecture, evaluate its visibility, and detail how Censys has been used by research, industry, and government. Finally, informed by our operational experiences, we discuss unsolved problems and the lessons we have learned. We hope that our work provides the transparency needed for researchers to soundly use Censys data and offers directions for future research.
Edge Caching as Differentiation
Muhammad Abdullah (EPFL); Mughees Ur Rehman (Virginia Tech); Pavlos Nikolopoulos, Katerina Argyraki (EPFL)
Abstract: Consider an end-user accessing two content providers, A and B, of the same type. If the end-user's ISP prioritizes A-traffic over B-traffic, the end-user may experience A-content with significantly better quality, and the ISP is said to apply "traffic differentiation." We observe that edge caching has a similar effect: if the end-user's ISP hosts a cache that serves A-content with higher hit rate than B-content, the end-user may experience A-content with significantly better quality. Hence, we examine caching as differentiation: We consider 5 popular caching providers, measure the hit rates with which they serve different content, and use the measurements to quantity the impact of edge caching on end-user Quality of Experience (QoE). We present the---in our opinion---surprising QoE disparities that result from edge caching and discuss their implications.
Raha: A General Tool to Analyze WAN Degradation
Behnaz Arzani (Microsoft Research); Sina Taheri (Microsoft); Pooria Namyar (University of Southern California); Ryan Beckett, Siva Kesava Reddy Kakarla (Microsoft Research); Elnaz Jalilipour (Microsoft)
Abstract: Raha is the first general tool that can analyze probable degradation of traffic engineered networks under arbitrary failures and traffic shifts to prevent outages. Raha addresses a significant gap in prior work which consider only (1) ≤k≤k failures; (2) specific traffic engineering schemes; and (3) the maximum impact of failures irrespective of the network design point.
Our insight is to formulate the problem in terms of heuristic analysis, where one seeks to maximize the performance gap between the network design point (i.e., the network with no failures) and the network under failures. We invent techniques that allow us to exploit the mechanisms within these tools to encode the problem into components which can handle them. We present extensive experiments on Microsoft's production network and those of Topology Zoo that demonstrate Raha is scalable and can effectively solve the problem. We use Raha to propose capacity augments that allow operators to mitigate potential problems and avoid future outages. Our results show Raha can find ≥2×≥2× higher degradations compared to those tools that only consider up to 22 failures.
10:50 — 12:10 | Hardware
Session Chair: Kai Chen
ParserHawk: Hardware-aware parser generator using program synthesis
Xiangyu Gao (University of Washington); Jiaqi Gao (Alibaba Cloud); Karan Kumar G, Muhammad Haseeb (New York University); Ennan Zhai (Alibaba Cloud); Bili Dong (Google); Joseph Tassarotti (New York University); Srinivas Narayana (Rutgers University); Anirudh Sivaraman (New York University)
Abstract: Parser programs are becoming increasingly complex to accommodate intricate network packet formats and advanced protocols. Existing parser compilers incorporate predefined program rewrite rules to output the low-level parser implementation. Yet, these rules are often brittle and sensitive to how the input parser program is written. As a result, generated implementations could consume more hardware resources than necessary. In some cases, these compilers unnecessarily reject valid parser programs that could have fit within the target device parser’s resource constraints.
We leverage program-synthesis-based techniques to build a parser compiler, ParserHawk, for 2 network devices: the Intel Infrastructure Processing Unit (IPU) and the Barefoot Tofino programmable switch. Naively formulating code generation as a program synthesis problem can take hours, if not days, to complete. As a result, ParserHawk incorporates several optimization algorithms, which achieve a geometric mean speed-up of 309.44×. Within a compile time on the order of minutes for most benchmarks, ParserHawk can correctly compile parser programs rejected by existing compilers and can generate parser implementations that use fewer hardware resources.
Nezha: SmartNIC-based Virtual Switch Load Sharing
Xing Li (Zhejiang University and Alibaba Cloud); Enge Song, Bowen Yang, Tian Pan, Ye Yang (Alibaba Cloud); Qiang Fu (School of Computing Technologies, RMIT University); Yang Song, Yilong Lv, Zikang Chen, Jianyuan Lu, Shize Zhang, Xiaoqing Sun, Rong Wen, Xionglie Wei (Alibaba Cloud); Biao Lyu (Zhejiang University and Alibaba Cloud); Zhigang Zong (Alibaba Cloud); Qinming He (Zhejiang University); Shunmin Zhu (Hangzhou Feitian Cloud and Alibaba Group)
Abstract: Cloud providers use SmartNIC-accelerated virtual switches (vSwitches) to offer rich network functions (NFs) for tenant VMs. Constrained by limited SmartNIC resources, it is a challenge to provide sufficient network performance for high-demand VMs. Meanwhile, we observed a significant number of idle vSwitches in the data center, which led us to consider leveraging them to build a remote resource pool for high-demand virtual NICs (vNICs). In this work, we propose Nezha, a distributed vSwitch load sharing system. Nezha reuses the existing idle SmartNICs to handle the excess load from the local SmartNIC without adding new devices. Nezha offloads stateless rule/flow tables to the remote, while keeping states locally. This eliminates the need for state synchronization, facilitating load sharing and failover. The deployment cost of Nezha is only a small fraction of that required to deploy new devices. Data collected from production shows that our CPS capability bottleneck has shifted from the vSwitch to the VM kernel stack, with #concurrent flows and #vNICs increased by up to 50.4x and 40x, respectively.
Mosaic: Breaking the Optics versus Copper Trade-off with a Wide-and-Slow Architecture and MicroLEDs
Kaoutar Benyahya, Ariel Gomez Diaz, Junyi Liu, Vassily Lyutsarev, Marianna Pantouvaki, Kai Shi, Shawn Yohanes Siew, Hitesh Ballani, Thomas Burridge, Daniel Cletheroe, Thomas Karagiannis, Brian Robertson, Ant Rowstron, Mengyang Yang (Microsoft Research); Arash Behziz, Jamie Gaudette (Microsoft Azure); Paolo Costa (Microsoft Research)
Abstract: Link technologies in today's networks impose a fundamental trade-off between reach, power, and reliability. Copper links are power-efficient and reliable but with very limited reach (<2 m). Optical links, in contrast, offer longer reach but at the expense of high power consumption and low reliability. As network speeds increase, this trade-off is becoming more pronounced, severely constraining future network scalability.
We introduce Mosaic, a novel optical link technology that breaks this trade-off. Unlike existing copper and optical links that rely on a narrow-and-fast architecture with a few high-speed channels, adopts a wide-and-slow design, employing hundreds of parallel low-speed, area- and power-efficient channels. To make such spatial multiplexing practical, Mosaic uses directly-modulated instead of lasers, along with massively multi-core imaging fibers, eliminating power-hungry laser drivers and complex electronics, and improving reliability. Our analysis shows that Mosaic achieves over the reach of copper while reducing power consumption by up to 69% and offering higher reliability than today's optical links. We demonstrate an end-to-end Mosaic prototype with 100 optical channels, each transmitting at 2Gbps, and show how it scales to 800Gbps and beyond with a reach of up to 50m. Mosaic is protocol-agnostic and seamlessly integrates with existing infrastructure, requiring no modifications to existing servers and switches, and offering a practical and scalable link solution for the future of networking.
Falcon: A Reliable, Low Latency Hardware Transport
Arjun Singhvi (Google); Nandita Dukkipati (Google LLC); Prashant Chandra, Hassan M. G. Wassel, Naveen Kr. Sharma, Anthony Rebello, Henry Schuh, Praveen Kumar, Behnam Montazeri, Neelesh Bansod, Sarin Thomas, Inho Cho, Hyojeong Lee Seibert, Baijun Wu, Rui Yang, Yuliang Li, Kai Huang, Qianwen Yin, Abhishek Agarwal (Google); Srinivas Vaduvatha (Meta); Weihuang Wang, Masoud Moshref (Nvidia); Tao Ji (Microsoft); David Wetherall, Amin Vahdat (Google)
Abstract: Hardware transports such as RoCE deliver high performance with minimal host CPU, but are best suited to special-purpose deployments that limit their use, e.g., backend networks or Ethernet with Priority Flow Control (PFC). We introduce Falcon, the first hardware transport that supports multiple Upper Layer Protocols (ULPs) and heterogeneous application workloads in general-purpose Ethernet datacenter environments (with losses and without special switch support). Key design elements include: delay-based congestion control with multipath load balancing; a layered design with a simple request-response transaction interface for multi-ULP support; hardware-based retransmissions and error-handling for scalability; and a programmable engine for flexibility. The first Falcon hardware implementation delivers a peak performance of 200 Gbps, 120 Mops/sec, with near-optimal operation completion times that are up to 8×× lower than CX-7 RoCE under network congestion, and up to 65% higher goodput under lossy conditions.
13:30 — 15:30 | AI for SysNet
Session Chair: Soudeh Ghorbani
Hattrick: Solving Multi-Class TE using Neural Models
Abd AlRhman AlQiam, Zhuocong Li (Purdue University); Satyajeet Singh Ahuja (Meta Platforms, Inc); Zhaodong Wang, Ying Zhang (Meta); Sanjay G. Rao, Bruno Ribeiro, Mohit Tawarmalani (Purdue University)
Abstract: While recent work shows ML-based approaches are a promising alternative to conventional optimization methods for Traffic Engineering (TE), existing research is limited to a single traffic class. In this paper, we present Hattrick, the first ML-based approach for handling multiple traffic classes, a key requirement of cloud and ISP WANs. As part of Hattrick we have developed (i) a novel neural architecture aligned with the sequence of optimization problems in multiclass TE; and (ii) a variant of classical multitask learning methods to deal with the unique challenge of optimizing multiple metrics that have a precedence relationship. Evaluations on a large private WAN and other public datasets show Hattrick outperforms state-of-the-art optimization-based multiclass TE methods by better coping with prediction error – e.g., for GEANT, Hattrick outperforms SWAN by 5.48% to 19.3% across classes when considering the traffic that can be supported 2 9’s of the time.
CClinguist: An Expert-Free Framework for Future-Compatible Congestion Control Algorithm Identification
Jiahui Li, Han Qi, Ruyi Yao, Jialin Wei, Ruoshi Sun, Zixuan Chen, Sen Liu, Yang Xu (Fudan University)
Abstract: Congestion control algorithms (CCAs) play a critical role in determining transmission quality. With their rapid evolution during the past few decades, understanding the CCA landscape on the Internet has become increasingly essential for network advancement. Traditional CCA census tools, however, rely heavily on manual configuration and construction, necessitating significant human effort to keep pace with the introduction of new CCAs.
In this paper, we introduce the CClinguist framework, which is an expert-free CCA identification tool. It mainly comprises a network profile auto-generator and a self-learning classifier to accommodate emerging CCAs and new network scenarios, ensuring future compatibility in both the temporal and spatial aspects of the network census. We also develop a prototype of CClinguist and evaluate its performance. The results demonstrate that CClingusit can accurately identify 12 known CCAs in the Linux kernel with 97.33% accuracy. Notably, it maintains high accuracy with new scenarios and emerging CCAs, including learning-based and ECN-based CCAs, outperforming state-of-the-art identification tools. When applied to a census of over 7,000 web servers, CClinguist successfully detects unknown CCA variants and is able to explore the support of ECN on web servers.
The Sweet Danger of Sugar: Debunking Representation Learning for Encrypted Traffic Classification
Yuqi Zhao, Giovanni Dettori, Matteo Boffa, Luca Vassio, Marco Mellia (Politecnico di Torino)
Abstract: Recently we have witnessed the explosion of proposals that, inspired by Language Models like BERT, exploit Representation Learning models to create traffic representations. All of them promise astonishing performance in encrypted traffic classification (up to 98% accuracy). In this paper, with a networking expert mindset, we critically reassess their performance. Through extensive analysis, we demonstrate that the reported successes are heavily influenced by data preparation problems, which allow these models to find easy shortcuts - spurious correlation between features and labels - during fine-tuning that unrealistically boost their performance. When such shortcuts are not present - as in real scenarios - these models perform poorly. We also introduce Pcap-Encoder, an LM-based representation learning model that we specifically design to extract features from protocol headers. Pcap-Encoder appears to be the only model that provides an instrumental representation for traffic classification. Yet, its complexity questions its applicability in practical settings. Our findings reveal flaws in dataset preparation and model training, calling for a better and more conscious test design. We propose a correct evaluation methodology and stress the need for rigorous benchmarking.
DeepSpace: Super Resolution Powered Efficient and Reliable Satellite Image Data Acquistion
Chuanhao Sun, Yu Zhang (The University of Edinburgh); Bill Tao, Deepak Vasisht (University of Illinois Urbana-Champaign); Mahesh Marina (The University of Edinburgh)
Abstract: Large constellations of low-earth orbit satellites enable frequent high-resolution earth imaging for numerous geospatial applications. They generate large volumes of data in space, hundreds of Terabytes per day, which much be transported to Earth through constrained intermittent connections to ground stations. The large volumes lead to large day-level delay in data download and exorbitant cloud storage costs. We propose DeepSpace, a new deep learning-based super-resolution approach that compresses satellite imagery by over two orders of magnitude, while preserving image quality using a tailored mixture of experts (MoE) super-resolution framework. DeepSpace reduces the network bandwidth requirements for space-Earth transfer, and can compress images for cloud storage. DeepSpace achieves such gains with the limited computational power available on small LEO satellites. We extensively evaluate DeepSpace against a wide range of state-of-the-art baselines considering multiple satellite image datasets and demonstrate the above mentioned benefits. We further demonstrate the effectiveness of DeepSpace through several distinct downstream applications (wildfire detection, land use and cropland classification, and fine-grained plastic detection in oceans).
Agua: A Concept-Based Explainer for Learning-Enabled Systems
Sagar Patel (University of California, Irvine); Dongsu Han (KAIST); Nina Narodytska (VMware Research by Broadcom); Sangeetha Abdu Jyothi (University of California, Irvine)
Abstract: While deep learning offers superior performance in systems and networking, adoption is often hindered by difficulties in understanding and debugging. Explainability aims to bridge this gap by providing insight into the model's decisions. However, existing methods primarily identify the most influential input features, forcing operators to perform extensive manual analysis of low-level signals (e.g., buffer t−1t−1, chunk size t+2t+2). In this paper, we introduce Agua, an explainability framework that explains a model's decisions using high-level, human understandable concepts (e.g., ``volatile network conditions''). Our concept-based explainability framework lays the foundation for intelligent networked systems, enabling operators to interact with data-driven systems. To explain the controller's outputs using concept-level reasoning, Agua builds a surrogate concept-based model of the controller with two mappings: one from the controller’s embeddings to a predefined concept space, and another from the concept space to the controller's output. Through comprehensive evaluations on diverse applications—adaptive bitrate streaming, congestion control, and distributed denial of service detection—we demonstrate Agua's ability to generate robust, high-fidelity (93-99%) explanations, outperforming prior methods. Finally, we demonstrate several practical use cases of Agua in networking environments---debugging unintended behaviors, identifying distribution shifts, devising concept-based strategies for efficient retraining, and augmenting environment-specific datasets.
Intent-Driven Network Management with Multi-Agent LLMs: The Confucius Framework
Zhaodong Wang, Samuel Lin, Guanqing Yan (Meta); Soudeh Ghorbani (Johns Hopkins University & Meta); Minlan Yu (Harvard University); Jiawei Zhou (Stony Brook University); Nathan Hu, Lopa Baruah, Sam Peters, Srikanth Kamath, Jerry Yang, Ying Zhang (Meta)
Abstract: Advancements in Large Language Models (LLMs) are significantly transforming network management practices. In this paper, we present our experience developing Confucius, a multi-agent framework for network management at Meta. We model network management workflows as directed acyclic graphs (DAGs) to aid planning. Our framework integrates LLMs with existing management tools to achieve seamless operational integration, employs retrieval-augmented generation (RAG) to improve long-term memory, and establishes a set of primitives to systematically support human/model interaction. To ensure the accuracy of critical network operations, Confucius closely integrates with existing network validation methods and incorporates its own validation framework to prevent regressions. Remarkably, Confucius is a production-ready LLM development framework that has been operational for two years, with over 60 applications onboarded. To our knowledge, this is the first report on employing multi-agent LLMs for hyper-scale networks.
Tian Pan, Enge Song, Yueshang Zuo, Shaokai Zhang, Yang Song, Jiangu Zhao, Wengang Hou, Jianyuan Lu, Xiaoqing Sun, Shize Zhang, Ye Yang (Alibaba Cloud); Jiao Zhang, Tao Huang (Purple Mountain Laboratories); Biao Lyu, Xing Li (Zhejiang University and Alibaba Cloud); Rong Wen, Zhigang Zong (Alibaba Cloud); Shunmin Zhu (Hangzhou Feitian Cloud and Alibaba Cloud)
Abstract: Layer-7 load balancers (L7 LBs) improve service performance, availability, and scalability in public clouds. They use I/O event notification facilities like Linux epoll to dispatch connections from the kernel to userspace workers. However, early epoll versions suffered from the thundering herd problem when multiple workers listened on the same port. Epoll exclusive (Linux 4.5) mitigates this but introduces a LIFO wakeup issue, causing connections to accumulate on a few workers. Reuseport (Linux 3.9) distributes traffic evenly across workers but is prone to hash collisions and is unaware of worker unavailability. As each worker handles multi-tenant traffic, inter-worker load balancing is essential to preventing worker overload and ensuring tenant isolation.
In this work, we propose Hermes, a userspace-directed I/O event notification framework to enhance L7 LBs. Hermes treats userspace worker status as a first-class citizen and uses it to direct kernel-space connection dispatch via a closed loop. Specifically, we implement lock-free concurrency management for worker status read/update as well as scheduling decision synchronization from userspace to kernel. In the kernel, we leverage eBPF to override the reuseport socket selection in a non-intrusive way for worker scheduling. Hermes has been deployed on O(100K) CPU cores across all 33 regions of Alibaba Cloud, handling O(10M) RPS traffic. The load balancing on CPU cores has significantly improved, with the average daily worker hangs decreased by 99.8%. The unit cost of our cloud infra for L7 LBs has dropped by 18.9%.
CEIO: A Cache-Efficient Network I/O Architecture for NIC-CPU Data Paths
Bowen Liu, Xinyang HUANG, Qijing LI (Hong Kong University of Science and Technology); Zhuobin Huang (University of Electronic Science and Technology of China); Yijun Sun, Wenxue Li, Junxue Zhang (Hong Kong University of Science and Technology); Ping Yin (Inspur); Kai Chen (Hong Kong University of Science and Technology)
Abstract: Efficient Input/Output (I/O) data path between NICs and CPUs/DRAMs is critical for supporting datacenter applications with high-performance network transmission, especially as link speed scales to 100Gbps and beyond. Traditional I/O acceleration strategies, such as Data Direct I/O (DDIO) and Remote Direct Memory Access (RDMA), perform suboptimally due to the inefficient utilization of the Last-Level Cache (LLC). This paper presents CEIO, a novel cache-efficient network I/O architecture that employs proactive rate control and elastic buffering to achieve zero LLC misses in the I/O data path while ensuring the effectiveness of DDIO and RDMA under various network conditions. We have implemented CEIO on commodity SmartNICs and incorporated it into widely-used DPDK and RDMA libraries. Experiments with well-optimized RPC framework and distributed file system under realistic workloads demonstrate that CEIO achieves up to 2.9×× higher throughput and 1.9×× lower P99.9 latency over prior work.
Centralium: A Hybrid Route-Planning Framework for Large-Scale Data Center Network Migrations
Abstract: Meta's data center networks have relied on BGP for routing and interconnectivity due to its scalability and simplicity. However, as our network evolves, BGP's limitations in supporting complex network migrations have become apparent. These migrations require customized routing at each intermediate step, considering topological properties. In this paper, we highlight the unique challenges of production migration and illustrate how native BGP falls short, unable to encode both sequential and spatial conditions. To address this challenge, we introduce a novel Route Planning Abstraction (RPA) that augments the BGP protocol to support migration. It enables centralized route planning while maintaining distributed enforcement. We demonstrate the power of this abstraction by developing Centralium, a hybrid route-planning framework with a centralized controller, and over ten use cases to support various migrations in production. Our production experience with Centralium, deployed alongside BGP in large-scale data centers, has shown substantial reduction in the time and risk of network migration operations.
ZENITH: Towards A Formally Verified Highly-Available Control Plane
Pooria Namyar, Arvin Ghavidel (University of Southern California); Mingyang Zhang (Google); Harsha V. Madhyastha, Srivatsan Ravi, Chao Wang, Ramesh Govindan (University of Southern California)
Abstract: Today, large-scale software-defined networks use microservice-based controllers. Bugs in these controllers can reduce network availability by making the data plane state inconsistent with the high-level intent. To recover from such inconsistencies, modern controllers periodically reconcile the state of all the switches with the desired intent. However, periodic reconciliation limits the availability and performance of the network at scale. We introduce Zenith, a microservice-based controller that avoids inconsistencies by design rather than always relying on recovery mechanisms. We have formally verified Zenith’s specifications and have proved that it ensures the network state will eventually be consistent with intent. We automatically generate Zenith’s code from its specification to minimize the likelihood of errors in the final implementation. Zenith’s guarantees and abstractions also enable developers to independently verify SDN applications and ensure end-to-end safety and correctness. Zenith resolves inconsistencies 5× faster than today’s designs and significantly improves availability.
Firefly: Scalable, Ultra-Accurate Clock Synchronization for Datacenters
Pooria Namyar (USC & Google LLC); Yuliang Li, Weitao Wang, Nandita Dukkipati, KK Yap, Junzhi Gong, Chen Chen, Peixuan Gao (Google LLC); Devdeep Ray (NVIDIA); Gautam Kumar, Yidan Ma (Google LLC); Ramesh Govindan (USC & Google LLC); Amin Vahdat (Google LLC)
Abstract: Cloud-based financial exchanges require sub-10nsns device-to-device clock synchronization accuracy while adhering to Coordinated Universal Time (UTC). Existing clock sync techniques struggle to meet this demand at scale and are vulnerable to clock drift, jitter and path asymmetries. Firefly, a software-driven datacenter clock sync system scalably, cost-effectively and reliably achieves very high clock sync accuracy. It employs a distributed consensus algorithm on a random overlay graph to rapidly converge to a common time while applying gradual adjustments to device hardware clocks. To realize consistent sync-to-UTC (external sync) across devices while maintaining a stable device-to-device internal sync, Firefly uses a novel technique, layered synchronization, that decouples internal and external syncs. In a 248-machine Clos network, Firefly achieves sub-10nsns device-to-device and ≤1µs device-to-UTC synchronization, and is resilient to time server failure and unstable clocks.
Alibaba Stellar: A New Generation RDMA Network for Cloud AI
Jie Lu, Jiaqi Gao, Fei Feng, Zhiqiang He, Menglei Zheng, Kun Liu, Jun He, Binbin Liao, Suwei Xu, Ke Sun, Yongjia Mo, Qinghua Peng, Jilie Luo, Qingxu Li, Gang Lu, Zishu Wang, Jianbo Dong, Kunling He, Sheng Cheng, Jiamin Cao, Hairong Jiao, Pengcheng Zhang, Shu Ma, Lingjun Zhu, Chao Shi, Yangming Zhang, Yiquan Chen, Wei Wang, Shuhong Zhu, Xingru Li, Qiang Wang, Jiang Liu, Chao Wang, Wei Lin, Ennan Zhai, Jiesheng Wu, Qiang Liu, Binzhang Fu, Dennis Cai (Alibaba Cloud)
Abstract: The rapid adoption of Large Language Models (LLMs) in cloud environments has intensified the demand for high-performance AI training and inference, where Remote Direct Memory Access (RDMA) plays a critical role. However, existing RDMA virtualization solutions, such as Single-Root Input/Output Virtualization (SR-IOV), face significant limitations in scalability, performance, and stability. These issues include lengthy container initialization times, hardware resource constraints, and inefficient traffic steering. To address these challenges, we propose Stellar, a new generation RDMA network for cloud AI. Stellar introduces three key innovations: Para-Virtualized Direct Memory Access (PVDMA) for on-demand memory pinning, extended Memory Translation Table (eMTT) for optimized GPU Direct RDMA (GDR) performance, and RDMA Packet Spray for efficient multi-path utilization. Deployed in our large-scale AI clusters, Stellar spins up virtual devices in seconds, reduces container initialization time by 73%, and improves LLM training speed by up to 14%. Our evaluations demonstrate that Stellar significantly outperforms existing solutions, offering a scalable, stable, and high-performance RDMA network for cloud AI.
16:30 — 17:15 | Shorts
Session Chair: George Porter
Coflow Scheduling for LLM Training
Xinchen Wan, Xinyu Yang, Kaiqiang Xu, Xudong Liao (Hong Kong University of Science and Technology); Yilun Jin (The Hong Kong University of Science and Technology); Yijun Sun, Zhenghang Ren (Hong Kong University of Science and Technology); Han Tian (The University of Science and Technology of China); Kai Chen (Hong Kong University of Science and Technology)
Abstract: Training large language models (LLMs) generates diverse coflows within a cluster, requiring optimized scheduling to enhance communication-computation overlap and minimize training time. Existing schedulers inadequately handle contention both across and within coflows, resulting in suboptimal performance.
We present Hermod, a comprehensive coflow scheduler that orchestrates all coflow types for LLM training. The key insight behind Hermod is that coflows can be uniquely characterized by three model factors—microbatch ID, coflow type, and layer ID—enabling optimal scheduling decisions. Leveraging this insight, Hermod applies model-factor–driven inter-coflow priority scheduling aligned with the LLM training DAG. Preliminary simulation results show potential for performance improvements.
HardMesh: Enabling High-performance Service Mesh Ingress Processing with SmartNICs
Myoungsung You (University of Seoul); Jaehyun Nam (Dankook University); Minjae Seo (ETRI); Taejune Park (Chonnam National University); Seungwon Shin (KAIST)
Abstract: Service meshes have become essential for enabling microservice communication in cloud environments; however, they also introduce substantial network overhead. In particular, the ingress gateway, which serves as the primary entry point for external traffic, has emerged as a major performance bottleneck due to CPU-intensive traffic analysis and prolonged forwarding paths through multiple network stack layers. Our analysis indicates that these inefficiencies can result in a 4-fold reduction in network throughput and increased CPU resource consumption. In response, we propose HardMesh, a hardware-software hybrid ingress gateway that leverages a SmartNIC for high-performance traffic analysis and efficient external traffic routing. This process is augmented by a lightweight CPU-based proxy for traffic management. Experimental results demonstrate that HardMesh outperforms existing ingress gateways, achieving up to 4.4×× higher throughput while providing the same range of traffic management services.
SpliDT: Partitioned Decision Trees for Scalable Stateful Inference at Line Rate
Murayyiam Parvez (Purdue University); Annus Zulfiqar (University of Michigan); Roman Beltiukov (University of California, Santa Barbara); Shir Landau Feibish (Open University of Israel); Walter Willinger (NIKSUN, Inc.); Arpit Gupta (University of California, Santa Barbara); Muhammad Shahbaz (University of Michigan)
Abstract: Machine learning is increasingly used in programmable data planes, such as switches and smartNICs, to enable real-time traffic analysis and security monitoring at line rate. Decision trees (DTs) are particularly well-suited for these tasks due to their interpretability and compatibility with the Reconfigurable Match-Action Table (RMT) architecture. However, current DT implementations require collecting all features upfront, which limits scalability and accuracy due to constrained data plane resources.
This paper introduces SpliDT, a scalable framework that reimagines DT deployment as a partitioned inference problem over a sliding window of packets (Figure 1). By dividing inference into sequential subtrees—each using its own set of top-k features—SpliDT supports more stateful features without exceeding hardware limits. An in-band control channel, implemented via packet recirculation, manages transitions between subtrees and reuses match keys and registers across partitions. This design allows physical resources to be shared efficiently while maintaining line-rate processing.
To maximize accuracy and scalability, SpliDT employs a custom training and design-space-exploration (DSE) workflow that jointly optimizes feature allocation, tree depth, and partitioning. Evaluations show that SpliDT supports up to 5x more features, scales to millions of flows, and outperforms baselines, with low overhead and minimal time-to-detection (TTD).
Carbon- and Precedence-Aware Scheduling for Data Processing Clusters
Adam Lechowicz (University of Massachusetts Amherst); Rohan Shenoy (University of California Berkeley); Noman Bashir (Massachusetts Institute of Technology); Mohammad Hajiesmaili (University of Massachusetts Amherst); Adam Wierman (California Institute of Technology); Christina Delimitrou (Massachusetts Institute of Technology)
Abstract: As large-scale data processing workloads continue to grow, their carbon footprint raises concerns. Prior research on carbon-aware schedulers has focused on shifting computation to align with the availability of low-carbon energy, but these approaches assume that each task can be executed independently. In contrast, data processing jobs have precedence constraints that complicate decisions, since delaying an upstream ``bottleneck'' task to a low-carbon period also blocks downstream tasks, impacting makespan. In this paper, we show that carbon-aware scheduling for data processing benefits from knowledge of both time-varying carbon and precedence constraints. Our main contribution is PCAPSPCAPS, a carbon-aware scheduler that builds on state-of-the-art scoring or probability-based techniques -- in doing so, it explicitly relates the structural importance of each task against the time-varying characteristics of carbon intensity. To illustrate gains due to fine-grained task-level scheduling, we also study CAPCAP, a wrapper for any carbon-agnostic scheduler that generalizes the provisioning ideas of PCAPSPCAPS. Both techniques allow a user-configurable priority between carbon and makespan, and we give basic analytic results to relate the trade-off between these objectives. Our prototype on a 100-node Kubernetes cluster shows that a moderate configuration of PCAPSPCAPS reduces carbon footprint by up to 32.9% without significantly impacting total efficiency.
HACK: Homomorphic Acceleration via Compression of the Key-Value Cache for Disaggregated LLM Inference
Zeyu Zhang, Haiying Shen (University of Virginia); Shay Vargaftik (VMware Research); Ran Ben Basat (University College London); Michael Mitzenmacher, Minlan Yu (Harvard University)
Abstract: Disaggregated Large Language Model (LLM) inference decouples the compute-intensive prefill stage from the memory-intensive decode stage, allowing low-end, compute-focused GPUs for prefill and high-end, memory-rich GPUs for decode, which reduces cost while maintaining high throughput. However, transmitting Key-Value (KV) data between the two stages can be a bottleneck, especially for long prompts. Additionally, the computational overhead in the two stages is key for optimizing Job Completion Time (JCT), and KV data size can become prohibitive for long prompts and sequences. Existing KV quantization methods can alleviate transmission and memory bottlenecks, but they introduce significant dequantization overhead, exacerbating the computation time.
We propose Homomorphic Acceleration via Compression of the KV cache (HACK) for disaggregated LLM inference. HACK eliminates the heavy KV dequantization and directly computes on quantized KV data to approximate and reduce the cost of expensive matrix multiplication. Extensive trace-driven experiments show that HACK reduces JCT by up to 70.9% compared to disaggregated LLM inference baseline and by up to 52.3% compared to state-of-the-art KV quantization methods.
Designing Transport-Level Encryption for Datacenter Networks
Tianyi Gao, Xinshu Ma, Suhas Narreddy, Eugenio Luo, Steven Chien, Michio Honda (University of Edinburgh)
Abstract: This paper presents SDP, a protocol design for emerging datacenter transports, such as NDP and Homa, to integrate data encryption. It supports NIC offloading designed for TLS over TCP, native protocol number alongside TCP and UDP, and message-based abstraction that enables low latency RPCs with fine-grained parallelism.
ZooRoute: Enhancing Cloud-Scale Network Reliability via Overlay Proactive Rerouting
Xiaoqing Sun, Xionglie Wei (Alibaba Cloud); Xing Li (Zhejiang University and Alibaba Cloud); Ju Zhang, Bowen Yang, Yi Wang, Ye Yang, Yu Qi, Le Yu, Chenhao Jia, Zhanlong Zhang, Xinyu Chen, Jianyuan Lu, Shize Zhang, Enge Song, Yang Song, Tian Pan, Rong Wen, Biao Lyu (Alibaba Cloud); Yang Xu (Fudan University); Shunmin Zhu (Hangzhou Feitian Cloud and Alibaba Cloud)
Abstract: This paper presents ZooRoute, a tenant-transparent, fast failure recovery service that requires no modifications to physical devices. ZooRoute leverages the overlay layer and enables traffic flows to bypass failures by altering source ports (srcPorts) in packet headers during encapsulation. To enable deployment in large-scale cloud networks, ZooRoute proposes: 1) On-demand probing to efficiently monitor a vast number of hosts while minimizing telemetry costs. 2) Table compression to record the states of numerous paths with limited on-chip resources. 3) A device-sensing mechanism to prevent unnecessary reconnections in stateful forwarding. Deployed in Alibaba Cloud for 18 months, ZooRoute has significantly improved network reliability, reducing cumulative outage time by 92.71%.
Achieving High-Speed and Robust Encrypted Traffic Anomaly Detection with Programmable Switches
Han Zhang (Tsinghua University); Guyue Liu (Peking University); Xingang Shi (Tsinghua University); Yahui Li (Tsinghua University,China); Dongbiao He (Sangfor Technologies Inc); Jilong Wang, Zhiliang Wang (Tsinghua University); Yongqing Zhu, Ke Ruan, Weihua Cao (China telecom); Xia Yin (Tsinghua University)
Abstract: Attacks against data centers are becoming more common as a result of the fast expansion of applications. In order to keep pace with the growing amount of data centers connected to their networks, internet service providers must offer comprehensive security services. However, existing network intrusion detection systems (NIDS) are either ineffective or inefficient for the high-speed encrypted network traffic. In this paper, we design and implement Mazu, an inline network intrusion detection system with programmable switches specifically developed to protect data centers connecting to the internet service provider. Mazu proposes a dual-plane feature extraction model to extract extensive traffic features at near line speed. Mazu also proposes a light-weight one-class classification model that trains the best parameters exclusively on benign traffic to identify the malicious traffic. In addition, Mazu introduces an online update mechanism aimed at dynamically adjusting the detection model in response to environmental changes. Mazu has been in production for two years, during which time it has identified over 10 critical attack events and protect more than 10 million servers for two ISPs. Our production and testbed evaluations demonstrate that Mazu can detect malicious traffic entering the data center sites with approximately 90% accuracy within minutes.
Palladium: A DPU-enabled Multi-Tenant Serverless Cloud over Zero-copy Multi-node RDMA Fabrics
Shixiong Qi (University of Kentucky); Songyu Zhang, K. K. Ramakrishnan (University of California, Riverside); Diman Zad Tootaghaj, Hardik Soni, Puneet Sharma (Hewlett Packard Labs)
Abstract: Serverless computing offers resource efficiency but suffers from a heavyweight data plane. We present Palladium, a DPU-offloaded serverless data plane enabling distributed zero-copy communication. Palladium uses two-sided RDMA and cross-processor shared memory to mitigate limitations of wimpy DPU cores. Its DPU-enabled network engine (DNE) isolates RDMA resources and manages flows across tenants. By converting HTTP/TCP to RDMA at ingress, Palladium reduces protocol overhead on the critical path.
Threading the Ocean: Mapping Digital Routes Across Submarine Cables using Calypso
Caleb Wang, Ying Zhang, Esteban Carisimo, Fabián E. Bustamante (Northwestern University); Ram Durairajan (University of Oregon); Qianli Dong (Northwestern University)
Abstract: The submarine cable network (SCN) — the backbone of global Internet connectivity — faces growing threats with severe economic and security implications. Enhancing its resilience requires a deeper understanding of the criticality of specific cables and landing points, which, in turn, depends on accurately mapping traffic to cables to identify vulnerabilities and assess their impact on regional and global connectivity. However, the lack of a robust framework for evaluating SCN criticality leaves researchers and policymakers without actionable insights to safeguard this vital infrastructure.
This paper introduces Calypso, a novel framework that maps traceroutes to the submarine cables they traverse, along with Route Stress, an experimentally inferred metric that estimates the relative importance of submarine cable infrastructure using traceroute data. Calypso addresses key challenges posed by the opacity of the submarine cable industry and the complexity of mapping network-layer paths to physical infrastructure. Through rigorous validation and case studies, we demonstrate the efficacy of Calypso in assessing SCN vulnerabilities and highlight its potential to provide insights to enhance the resilience of global submarine infrastructure.
AoRA: AI-on-RAN for Backhaul-free Edge Inference
Siyavushkhon Kholmatov (Korea Advanced Institute of Science and Technology); Seongsik Cho (Seoul National University); Song Chong (Korea Advanced Institute of Science and Technology); Kyunghan Lee (Seoul National University)
Abstract: In cellular networks, edge intelligence is often enabled by Multi-Access Edge Computing (MEC), which aims to bring AI services closer to end users. Although MEC reduces latency by placing computation near the network edge, it remains external to the Radio Access Network (RAN) and its native execution environment, thereby introducing additional transport and buffering delays. The emerging AI-on-RAN paradigm proposes to overcome these limitations but remains largely conceptual, lacking practical implementation and feasibility validation. In this paper, we present AoRA, the first framework realizing the AI-on-RAN vision by dynamically utilizing available computational headroom in GPU- and NPU-accelerated RAN platforms to deliver AI services directly from within the base stations. AoRA leverages containerized AI workloads in the 5G RAN stack to enable inlined and opportunistic AI service provisioning without degrading core telecom operations. The framework is fully compliant with O-RAN interfaces and can operate seamlessly alongside existing edge computing infrastructures. Evaluations show that AoRA reduces transport latency by over 30% compared to MEC and 70% compared to cloud-based setups.
NIER: Practical Neural-enhanced Low-bitrate Video Conferencing
Anlan Zhang (University of Southern California); Yuming Hu (University of Minnesota Twin Cities); Chendong Wang (University of Wisconsin Madison); Yu Liu, Zejun Zhang (University of Southern California); Haoyu Gong (University of Minnesota, Twin Cities); Ahmad Hassan (University of Southern California); Shichang Xu (Google); Zhenhua Li (Tsinghua University); Bo Han (George Mason University); Feng Qian (University of Southern California)
Abstract: In this paper, we develop NIER, a practical low-bitrate video conferencing solution. It can adaptively maintain a low bitrate (e.g., 10–100 Kbps) with reasonable visual quality while being robust to packet losses. Satisfying these design requirements makes NIER suitable for a wide range of usage scenarios, in particular over challenging/metered networks. Under the hood, NIER leverages key-point-based deep image animation (DIA) as a key building block, where the sender transmits sparse key-points alongside a reference image, and the receiver reconstructs the original video frames by animating the reference image using the key-points’ motion. To make DIA practical, NIER addresses a series of challenges in networking and system dimensions, including robustly updating reference frames, adapting to fluctuating bandwidth, handling varying packet loss rates, and achieving line-rate frame processing on commodity client devices. Our extensive evaluations (including an IRB-approved user study involving 20 participants) demonstrate that NIER considerably outperforms several baseline solutions (traditional video codecs, super resolution-enhanced video conferencing, forward error coding (FEC), loss-resilient neural codec, and naive application of key-point-based DIA) in terms of end-to-end latency, decodable frame ratio, frame rate, video quality, and/or users’ quality-of-experience (QoE).
Discovering Millions of New Nodes and Links in the Internet by Challenging the Uniformity Assumption in Multipath Detection
Zhongxu Guan (Tsinghua University); Shuai Wang, Li Chen, Zhaoteng Yan (Zhongguancun Laboratory); Jiaye Lin, Dan Li (Tsinghua University); Yong Jiang (Tsinghua Shenzhen International Graduate School); Yingxin Wang, Ziqian Liu (China Telecom Cybersecurity Technology Co.,Ltd.)
Abstract: Multipath Detection Algorithms (MDA) are proposed to discover Internet topology under load balancing (LB). Existing methods assume uniformity in the load-balancing responses (LBR), i.e., responses from the successors of a LB router. However, we reveal that only 20% of the cases exhibit uniformity in the Internet. This finding significantly challenges the completeness of the Internet topology discovered using current MDA.
In this paper, we introduce BayMuDA, a novel system overcoming this limitation by eliminating uniformity assumptions. Leveraging the Markov property of packet forwarding, BayMuDA employs a Bayesian network to frame LBR distribution retrieval as a parameter estimation problem with incomplete data. Using the estimated LBR distributions, BayMuDA calculates the minimum probes needed to statistically discover all nodes and edges within a given hop. In validation on controlled topologies, BayMuDA discovers at least 85%/73% of nodes/links in ~90% of the cases. In the Internet-wide measurement, BayMuDA discovers millions of Internet nodes and links obscured by the state-of-the-art MDA, D-Miner, due to uneven responses. Even with the same number of probes, BayMuDA can discover 1.3×× links than D-Miner by avoiding redundant probing. Based on the comprehensive topology, we observe improved Internet path stability, while multipath topologies' scales have become larger.
Wisely Optimizing Short Video Streaming for a User-Vendor Win-Win Outcome
Jinlong E (Renmin Univerity of China); Wei Xu, Jianfei Bi (Renmin University of China); Lin He (Tsinghua University); Haoyang Li (Columbia University); Anqi Gu (Renmin University of China); Dan Yang (ByteDance Inc.); Yunpeng Chai (Renmin University of China)
Abstract: Short video streaming platforms widely employ video prefetching to ensure users' quality of experience (QoE), but frequent user swipes lead to massive data wastage, creating a significant financial burden for vendors. Existing academic and industrial solutions fail to strike a balance, often sacrificing either data savings or the authentic user-perceived QoE. We introduce a framework that intelligently reduces streaming data consumption without compromising user experience. The core idea is to make prefetching decisions adaptive to both user swiping behavior and dynamic network conditions. Real-world evaluations show our framework significantly outperforms the state-of-the-art solutions on data wastage as well as user-perceived QoE.
Day 3 - Wednesday, September 10 2025
08:30 — 10:30 | NetMon
Session Chair: Li Chen
DNSLogzip: A Novel Approach to Fast and High-Ratio Compression for DNS Logs
Yunwei Dai (Southeast University; Jiangsu Future Networks Innovation Institute); Guyue Liu (Peking University); Tao Huang (Southeast University; State Key Laboratory of Networking and Switching Technology, BUPT, China; Purple Mountain Laboratories); Shuo Wang (State Key Laboratory of Networking and Switching Technology, BUPT, China; Purple Mountain Laboratories); Yong Wang, Xingli Wu (Jiangsu Zhiwang Technology Co., Ltd.); Lei Song, Heshun Li, Chuang Wang (China Mobile Communications Group Shandong Co., Ltd.); Fanglong Hu (China Telecom Co., Ltd.); Hong Sun, Yanan Li (China United Network Communications Corporation Jiangsu Branch)
Abstract: Domain Name System (DNS) logs capture detailed records of the queries and responses exchanged between DNS servers and clients, playing a crucial role in applications such as cybersecurity monitoring and regulatory compliance, which often require long-term data retention. With the rapid growth of Internet traffic, the volume of DNS logs has surged, presenting significant storage challenges. Although many DNS operators use general-purpose compression algorithms to reduce storage costs, these solutions fail to fully exploit the unique characteristics of DNS data, leading to inefficiencies and rising storage demands.
In this work, we propose a novel solution, DNSLogzip, designed to achieve lossless, fast, and high-ratio compression for DNS logs. Built on in-depth empirical studies of real-world DNS log datasets, we identify four key inter-line and intra-line characteristics that enable effective reduction of redundancies without information loss. DNSLogzip leverages these insights through a modular compression architecture, which allows it to handle varying log formats and offers easy integration and customization based on specific requirements. We have deployed DNSLogzip on two tier-1 Internet Service Providers (ISPs) to compress production logs. The results show that DNSLogzip can reduce storage costs by approximately two-thirds, potentially saving up to $163,000 per DNS service node per month.
Hawkeye: Diagnosing RDMA Network Performance Anomalies with PFC Provenance
Shicheng Wang (Tsinghua University); Menghao Zhang (Beihang University); Xiao Li (Tsinghua University); Qiyang Peng (Beihang University); Haoyuan Yu, Zhiliang Wang, Mingwei Xu (Tsinghua University); Xiaohe Hu (Infrawaves); Jiahai Yang, Xingang Shi (Tsinghua University)
Abstract: RDMA is becoming increasingly prevalent from private data centers to public multi-tenant clouds, due to its remarkable performance improvement. However, its lossless traffic control, i.e., PFC, introduces new complexities in network performance anomalies (NPAs) due to its cascading congestion spreading property, which usually incurs complaints from customers/applications about certain flows’ performance degradation. Existing studies fall short in fine-grained visibility of PFC impact and traceability of PFC causality, and are thus ineffective in diagnosing the root causes for RDMA NPAs. In this paper, we propose Hawkeye, an accurate and efficient RDMA NPA diagnosis system based on PFC provenance. Hawkeye comprises 1) a fine-grained PFC-aware telemetry mechanism to record the PFC impact on flows; 2) an in-network PFC causality analysis and tracing mechanism to quickly and efficiently collect causal telemetry for diagnosis; and 3) a provenance-based diagnosis algorithm to comprehensively present the anomaly breakdown, identifying the anomaly type and root causes accurately. Through extensive evaluations on both NS-3 simulations and a Tofino testbed, Hawkeye can quickly and accurately diagnose multiple RDMA NPAs with over 90% precision and 1-4 orders of magnitude lower overhead than baselines.
Towards LLM-Based Failure Localization in Production-Scale Networks
Chenxu Wang (Nanjing University and Alibaba Cloud); Xumiao Zhang (Alibaba Cloud); Runwei Lu (New York University Shanghai and Alibaba Cloud); Xianshang Lin, Xuan Zeng, Xinlei Zhang, Zhe An, Gongwei Wu, Jiaqi Gao (Alibaba Cloud); Chen Tian, Guihai Chen (Nanjing University); Guyue Liu (Peking University); Yuhong Liao, Tao Lin, Dennis Cai, Ennan Zhai (Alibaba Cloud)
Abstract: Root causing and failure localization are critical to maintain reliability in cloud network operations. When an incident is reported, network operators must review massive volumes of monitoring data and identify the root cause (i.e., error device) as fast as possible, making it extremely challenging even for experienced operators. Large language models (LLMs) have shown great potential in text understanding and reasoning. In this paper, we present BIAN, an LLM-based framework designed to assist operators in efficient incident investigation. BIAN processes monitoring data and generates error device rankings with detailed explanations. To date, BIAN has been deployed in our network infrastructure for 10 months and it has successfully assisted operators in identifying error devices more quickly, reducing time to root causing by 20.5% (55.2% for high-risk incidents). Extensive performance evaluations based on 17 months of real cases further demonstrate that BIAN achieves accurate and fast failure localization. It improves accuracy by 9.2% compared to the baseline approach.
SkyNet: Analyzing Alert Flooding from Severe Network Failures in Large Cloud Infrastructures
Bo Yang, Huanwu Hu, Yifan Li, Yunguang Li, Xiangyu Tang, Bingchuan Tian, Gongwei Wu, Jianfeng Xu, Xumiao Zhang, Feng Chen, Cheng Wang, Ennan Zhai, Yuhong Liao, Dennis Cai, Tao Lin (Alibaba Cloud)
Abstract: For companies operating large-scale global networks like us, the timeliness of network failure recovery significantly impacts network services' reliability. Ideally, network monitoring system should have enough coverage to detect even minor issues, but high coverage means alert floods during severe network failures. In practice, there is a gap between the flooding raw alerts data collected by network monitoring tools and the readable information needed for failure diagnosis. Existing solutions using limited network monitoring data sources and heuristic diagnostic rules, lack comprehensive coverage and the capability to address severe failures, especially which network operators have never handled a similar one before. In this paper, we introduce SkyNet, a network analysis system to extract scope and severity information from alert floods. SkyNet ensures comprehensive coverage by integrating multiple monitoring data sources through a uniform input format, enhancing extensibility for new network monitoring tools. During alert floods, SkyNet groups alerts, assesses their severity, and filters out insignificant ones to aid network operators in mitigating network failures. To date, SkyNet has been running stably on our network for 9 months without any false negatives and has successfully reduced the time-to-mitigation for over 80% of network failures since its deployment in production.
SkeletonHunter: Diagnosing and Localizing Network Failures in Containerized Large Model Training
Wei Liu (Tsinghua University and Alibaba Cloud); Kun Qian (Alibaba Cloud); Zhenhua Li (Tsinghua University); Tianyin Xu (University of Illinois Urbana-Champaign); Yunhao Liu (Tsinghua University); Weicheng Wang, Yun Zhang, Jiakang Li, Shuhong Zhu, Xue Li, Hongfei Xu, Fei Feng, Ennan Zhai (Alibaba Cloud)
Abstract: The flexibility, portability, and isolation characteristics have made containers a popular environment for large model training in recent years. Unfortunately, these advantages render the network support for containerized large model training extremely challenging, due to the high dynamics of containers, the complex interplay between underlay and overlay networks, and the stringent requirements on failure detection and localization. Existing data center network debugging tools, which rely on comprehensive or opportunistic monitoring, are either inefficient or inaccurate in this setting.
This paper presents SkeletonHunter, a container network monitoring and diagnosis system that leverages the intrinsic and regular sparsity of the network traffic incurred by large model training. Its key idea is to reason about the traffic skeleton, which comprises a crucial set of network paths consistently traversed by the training traffic, so as to reliably detect and localize network failures in short time. We deployed it in production for six months, uncovering 4,816 network failures with 98.2% precision and 99.3% recall, and localizing them with a high accuracy of 95.7%. After fixing 98% problematic network components, the monthly network failure rate has significantly dropped by 99.1%.
ByteTracker: An Agentless and Real-time Path-aware Network Probing System
Shixian Guo (ByteDance China); Kefei Liu (Beijing University of Posts and Telecommunications); Yulin Lai, Yangyang Bai, Ziwei Zhao, Songlin Liu, Jianghang Ning, Gen Li, Jianwei Hu, Yongbin Dong, Feng Luo, Sisi Wen, Qi Zhang, Yuan Chen, Jiale Feng, Yang Bai, Chengcai Yao, Zhe Liu, Xin Hu, Yang Lv, Zhuo Jiang (ByteDance China); Jiao Zhang, Tao Huang (Beijing University of Posts and Telecommunications)
Abstract: As the number of data center servers grows into the millions and due to the demand for more accurate, rapid and powerful network fault detection and location, the existing Pingmesh-centric monitoring and diagnostic system is not efficient enough. In this paper, we propose ByteTracker, the first agentless probing and diagnostic system for large-scale data center networks. It does not need to deploy probe processes or make any configurations on end hosts, and all probes are launched by a small number of centralized Probers. ByteTracker achieves accurate, real-time probe path tracking with packet mirroring on switches. By reducing end-host probe noise, precisely identifying network timeout probes, accurately tracking probe paths, and marking the failed switch with multiple network timeout probes, ByteTracker can locate network failures with nearly 100% accuracy. We have deployed ByteTracker in all of our data centers for over half a year. During deployment, ByteTracker can detect almost all network anomalies and locate them within 5 seconds with 100% accuracy.
08:30 — 10:30 | NetAI
Session Chair: Costin Raiciu
MixNet: A Runtime Reconfigurable Optical-Electrical Fabric for Distributed Mixture-of-Experts Training
Xudong Liao, Yijun Sun (Hong Kong University of Science and Technology); Han Tian (The University of Science and Technology of China); Xinchen Wan (Hong Kong University of Science and Technology); Yilun Jin (The Hong Kong University of Science and Technology); Zilong Wang, Zhenghang Ren, Xinyang Huang, Wenxue Li (Hong Kong University of Science and Technology); Kin Fai Tse (ITSC, The Hong Kong University of Science and Technology); Zhizhen Zhong (Massachusetts Institute of Technology); Guyue Liu (Peking University); Ying Zhang (Meta); Xiaofeng Ye (EmbedWay); Yiming Zhang (Xiamen University); Kai Chen (Hong Kong University of Science and Technology)
Abstract: Mixture-of-Expert (MoE) models outperform conventional models by selectively activating different subnets, named experts, on a per-token basis. This gated computation generates dynamic communications that cannot be determined beforehand, challenging the existing GPU interconnects that remain static during distributed training. In this paper, we advocate for a first-of-its-kind system, called MixNet that unlocks topology reconfiguration during distributed MoE training. Towards this vision, we first perform a production measurement study and show that the MoE dynamic communication pattern has strong locality, alleviating the requirement of global reconfiguration. Based on this, we design and implement a regionally reconfigurable high-bandwidth domain that augments existing electrical interconnects using optical circuit switching (OCS), achieving scalability while maintaining rapid adaptability. We build a fully functional MixNet prototype with commodity hardware and a customized collective communication runtime. Our prototype trains state-of-the-art MoE models with in-training topology reconfiguration across 32 A100 GPUs. Large-scale packet-level simulations show that MixNet achieves performance comparable to a non-blocking fat-tree fabric while boosting the networking cost efficiency (e.g., performance per dollar) of four representative MoE models by 1.2××--1.5×× and 1.9××--2.3×× at 100 Gbps and 400 Gbps link bandwidths, respectively.
Orderlock: A New Type of Deadlock and its Implications on High Performance Network Protocol Design
Weihao Jiang, Wenli Xiao, Yuqing Yang (Shanghai Jiao Tong University); Peirui Cao (Nanjing University); Shizhen Zhao (Shanghai Jiao Tong University)
Abstract: In the pursuit of designing high-performance network (HPN) protocols, three critical features for effective transmission have been extensively studied: In-order Delivery, Lossless Transmission, and Out-of-order Capability. However, no practical implementation has successfully achieved all three simultaneously. We identify and prove that the simultaneous realization of these features constitutes a necessary and sufficient condition for a new type of deadlock, which we term Orderlock. We demonstrate that operating in an Orderlock-risky network is impractical and conduct a comprehensive exploration and comparison of Orderlock-free protocols, through a case study tuning AI workload performance. From an Orderlock-prevention perspective, our findings provide insights into the requirements for future HPN protocol and hardware designs.
MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism
Ruidong Zhu (School of Computer Science, Peking University); Ziheng Jiang (ByteDance Ltd); Chao Jin (School of Computer Science, Peking University); Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu (ByteDance Ltd); Xuanzhe Liu, Xin Jin (School of Computer Science, Peking University); Xin Liu (ByteDance Ltd)
Abstract: Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity. However, its sparsely activated architecture shifts feed-forward networks (FFNs) from being compute-intensive to memory-intensive during inference, leading to substantially lower GPU utilization and increased operational costs.
We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models. MegaScale-Infer disaggregates attention and FFN modules within each model layer, enabling independent scaling, tailored parallelism strategies, and heterogeneous deployment for both modules. To fully exploit disaggregation in the presence of MoE's sparsity, MegaScale-Infer introduces ping-pong pipeline parallelism, which partitions a request batch into micro-batches and shuttles them between attention and FFNs for inference. Combined with distinct model parallelism for each module, MegaScale-Infer effectively hides communication overhead and maximizes GPU utilization. To adapt to disaggregated attention and FFN modules and minimize data transmission overhead (e.g., token dispatch), MegaScale-Infer provides a high-performance M2N communication library that eliminates unnecessary GPU-to-CPU data copies, group initialization overhead, and GPU synchronization. Experimental results indicate that MegaScale-Infer achieves up to 1.9×× higher per-GPU throughput than state-of-the-art solutions.
Astral: A Datacenter Infrastructure for Large Language Model Training at Scale
Qingkai Meng, Hao Zheng, Zhenhui Zhang (Nanjing University); ChonLam Lao (Harvard University); Chengyuan Huang (Nanjing University); Baojia Li, Ziyuan Zhu, Hao Lu, Weizhen Dang, Zitong Lin, Weifeng Zhang, Lingfeng Liu, Yuanyuan Gong, Chunzhi He, Xiaoyuan Hu, Yinben Xia, Xiang Li, Zekun He, Yachen Wang, Xianneng Zou (Tencent); Kun Yang (Nanjing University); Gianni Antichi (Politecnico di Milano and Queen Mary University of London); Guihai Chen, Chen Tian (Nanjing University)
Abstract: The flourishing of Large Language Models (LLMs) calls for increasingly ultra-scale training. In this paper, we share our experience in designing, deploying, and operating our novel Astral datacenter infrastructure, along with operational lessons and evolutionary insights gained from its production use. Astral has three important innovations: (i) a same-rail interconnection network architecture on tier-2, which enables the scaling of LLM training. To physically deploy this high-density infrastructure, we introduce a distributed high-voltage direct current power system and a new air-liquid integrated cooling system. (ii) a full-stack monitoring system featuring cross-host and hierarchical logging correlation, which diagnoses failures at scale and precisely localizes root causes. (iii) an operator-granular forecasting component Seer that efficiently generates operator execution timelines with acceptable accuracy, aiding in fault diagnosis, model tuning, and network architecture upgrading. Astral infrastructure has been gradually deployed over 18 months, supporting LLM training and inference for multiple customers.
SGLB: Scalable and Robust Global Load Balancing in Commodity AI Clusters
Chenchen Qi (ByteDance); Wenfei Wu (Peking University); Yongcan Wang (ByteDance); Keqiang He (Shanghai Jiao Tong University); Yu-Hsiang (Sean) Kao (ByteDance); Zongying He (Broadcom); Chen-Yu Yen, Zhuo Jiang, Feng Luo (ByteDance); Surendra Anubolu, Yanjin Gao (Broadcom); Bingfeng Lin, Wenda Ni, Yiming Yang, Donglin Wei, Boyang Zhou, Jian Wang, Shan Ding (ByteDance)
Abstract: Internet companies are constructing large-scale AI clusters with commodity Ethernet switches for AI model training to support their businesses. AI training workloads impose stringent network requirements, mandating that cluster networks deliver high peak throughput while maintaining robustness and resilience in the face of link failures. We present SGLB, a distributed, global congestion-aware load balancing system for AI clusters. SGLB operates a control-plane protocol, SyncMesh, to enable a new load balancing abstraction in modern commodity switches—Global Load Balancing (GLB) engine—which utilizes global congestion information to distribute traffic across all available paths. We address three key challenges in designing SGLB: fast routing convergence to minimize downtime in the event of link failures, scalable maintenance of congestion profiles within the constraints of limited switch hardware resources, and preventing GLB throughput suppression in scenarios where path bandwidths are asymmetric. We prototype SGLB and conduct extensive experiments to evaluate SGLB. SGLB ensures rapid routing convergence in the event of link failures, recovering in as little as 45us to guarantee network robustness for long-term, stable model training. Additionally, SGLB effectively load-balances traffic across paths, avoiding those with global congestion, which accelerates All-to-All collective communication by up to 60%.
SyCCL: Exploiting Symmetry for Efficient Collective Communication Scheduling
Jiamin Cao (Alibaba Cloud); Shangfeng Shi (Tsinghua University and Alibaba Cloud); Jiaqi Gao (Alibaba Cloud); Weisen Liu, Yifan Yang (Tsinghua University and Alibaba Cloud); Yichi Xu, Zhilong Zheng, Yu Guan, Kun Qian (Alibaba Cloud); Ying Liu, Mingwei Xu (Tsinghua University); Tianshu Wang, Ning Wang, Jianbo Dong, Binzhang Fu, Dennis Cai, Ennan Zhai (Alibaba Cloud)
Abstract: The performance of collective communication schedules is crucial for the efficiency of machine learning jobs and GPU cluster utilization. Existing open-source collective communication libraries (such as NCCL and RCCL) rely on fixed schedules and cannot adjust to varying topology and model requirements. State-of-the-art collective schedule synthesizers (such as TECCL and TACCL) utilize Mixed Integer Linear Program for modeling but encounter search space explosion and scalability challenges. In this paper, we propose SyCCL, a scalable collective schedule synthesizer that aims to synthesize near-optimal schedules in tens of minutes for production-scale machine-learning jobs. SyCCL leverages collective and topology symmetries to decompose the original collective communication demand into smaller sub-demands within smaller topology subsets. SyCCL proposes efficient search strategies to quickly explore potential sub-demands, synthesizes corresponding sub-schedules, and integrates these sub-schedules into complete schedules. Our 32-A100 testbed and production-scale simulation experiments show that SyCCL improves collective performance by up to 127% while reducing synthesis time by 2 to 4 orders of magnitude compared to state-of-the-art efforts.The performance of collective communication schedules is crucial for the efficiency of machine learning jobs and GPU cluster utilization. Existing open-source collective communication libraries (such as NCCL and RCCL) rely on fixed schedules and cannot adjust to varying topology and model requirements. State-of-the-art collective schedule synthesizers (such as TECCL and TACCL) utilize Mixed Integer Linear Program for modeling but encounter search space explosion and scalability challenges. In this paper, we propose SyCCL, a scalable collective schedule synthesizer that aims to synthesize near-optimal schedules in tens of minutes for production-scale machine-learning jobs. SyCCL leverages collective and topology symmetries to decompose the original collective communication demand into smaller sub-demands within smaller topology subsets. SyCCL proposes efficient search strategies to quickly explore potential sub-demands, synthesizes corresponding sub-schedules, and integrates these sub-schedules into complete schedules. Our 32-A100 testbed and production-scale simulation experiments show that SyCCL improves collective performance by up to 127% while reducing synthesis time by 2 to 4 orders of magnitude compared to state-of-the-art efforts.
13:30 — 15:30 | Network Architecture
Session Chair: James Hongyi Zeng
Fornax: A Hardware-Centric Session Management in Large Public Cloud Network
Heng Yu (Zhongguancun Laboratory); Jian Wang, Jian Zhao, Kai Ren (Tencent); Guozhi Lin (Zhongguancun Laboratory); Baozeng Zhang, Yunpeng Guan, Xin Li, Hao Yin, Jiajun Liang, Liang Wang, Chao Pei, Yachen Wang (Tencent); Xin Jin (Peking University); Jilong Wang (Zhongguancun Laboratory); Congcong Miao (Tencent)
Abstract: SmartNIC is increasingly utilized to accelerate cloud network components. The effectiveness and correctness of hardware acceleration heavily rely on its management mechanism. Unfortunately, traditional management mechanisms adopt software-centric architecture, which treats flow as the basic management unit and completely relies on one-way commands to manage the flow table, making it challenging to support various cloud network scenarios while managing extremely large tables. In this paper, we advocate for a radical new mechanism to shift the management paradigm from software-centric architecture to hardware-centric architecture, which adopts session as the basic management unit and designs two-way protocols to facilitate the management process. We propose and implement a first-of-its-kind system, called Fornax, a novel management architecture for large public cloud networks. At the core of Fornax is leveraging a session-empowered hardware engine to provide various management capabilities. Besides, Fornax utilizes a light-weight software manager to enhance system scalability, and hardware-driven management protocols to improve resource efficiency. Our testbed evaluations demonstrate that Fornax can reduce the software storage usage by 80% and CPU usage by 77% with little hardware resource overhead. Our large-scale production results show that Fornax can manage up to 16M session entries while significantly reducing the resource overhead by over 79%.
NPC: Rethinking Dataplane through Network-aware Packet Classification
Xinyi Zhang, Qianrui Qiu (Computer Network Information Center, CAS; University of Chinese Academy of Sciences); Zhiyuan Xu (University of Chinese Academy of Sciences); Peng He (Independent Researcher); Xilai Liu (Institute of Computing Technology, Chinese Academy of Sciences;University of Chinese Academy of Sciences); Kavé Salamatian (LISTIC; Université Savoie Mont Blanc); Changhua Pei, Gaogang Xie (Computer Network Information Center, CAS; University of Chinese Academy of Sciences)
Abstract: Packet classification is a critical component for accurately categorizing traffic in network systems. The efficiency of packet classification algorithms is primarily determined by two key factors: the classifier's data structure and the characteristics of the traffic being classified. While significant efforts have been made to optimize data structures, the potential of leveraging traffic characteristics remains underexplored. In this study, we revisit the network dataplane by integrating the network measurement module with the packet classification module. We propose an innovative Network-aware Packet Classification system (NPC) that utilizes sketch techniques to extract network traffic features. These features guide the construction of decision trees, enabling efficient and adaptable packet classification across diverse network environments. Experimental results demonstrate that the NPC achieves speedups ranging from 1.86×× to 23.88×× over state-of-the-art algorithms, while significantly reducing memory overhead and construction time, highlighting its practical value in real-world scenarios. Furthermore, integrating NPC into Open vSwitch (OVS) yields throughput improvements of 10.71×× to 13.01×× compared to the native OVS.
Pegasus: A Universal Framework for Scalable Deep Learning Inference on the Dataplane
Yinchao Zhang, Su Yao, Yong Feng, Kang Chen (Tsinghua University); Tong Li (Renmin University of China); Zhuotao Liu (Tsinghua University); Yi Zhao (Beijing Institute of Technology); Lexuan Zhang, Xiangyu Gao (Tsinghua University); Feng Xiong (Beihang University); Qi Li, Ke Xu (Tsinghua University)
Abstract: The paradigm of Intelligent DataPlane (IDP) embeds deep learning (DL) models on the network dataplane to enable intelligent traffic analysis at line-speed. However, the current use of the match-action table (MAT) abstraction on the dataplane is misaligned with DL inference, leading to several key limitations, including accuracy degradation, limited scale, and lack of generality. This paper proposes Pegasus to address these limitations. Pegasus translates DL operations into three dataplane-oriented primitives to achieve generality: Partition, Map, and SumReduce. Specifically, Partition “divides” high-dimensional features into multiple low-dimensional vectors, making them more suitable for the dataplane; Map "conquers" computations on the low-dimensional vectors in parallel with the technique of fuzzy matching, while SumReduce "combines" the computation results. Additionally, Pegasus employs Primitive Fusion to merge computations, improving scalability. Finally, Pegasus adopts full precision weights with fixed-point activations to improve accuracy. Our implementation on a P4 switch demonstrates that Pegasus can effectively support various types of DL models, including Multi-Layer Perceptron (MLP), Recurrent Neural Network (RNN), Convolutional Neural Network (CNN), and AutoEncoder models on the dataplane. Meanwhile, Pegasus outperforms state-of-the-art approaches with an average accuracy improvement of up to 22.8%, along with up to 248x larger model size and 212x larger input scale.
PtCAM: Scalable High-Speed Name Prefix Lookup using TCAM
Tian Song, Tianlong Li, Yating Yang (Beijing Institute of Technology)
Abstract: Name prefix lookup is a core characteristic of name-based packet forwarding in ICN and also URL redirections in CDN. This operation is essentially a longest prefix match on URL-like names, which faces significant challenges in high-speed processing due to large set, length-unbounded names and per-packet lookup need. In this paper, we propose a scheme to support generic hierarchical name lookups on millions of prefixes with fast, constant and guaranteed performance using TCAM. Our work has three contributions. First, a scheme, namely PtCAM, is proposed to exploit TCAM for name lookup using a binary Patricia trie. Second, several approaches are proposed to enhance efficiency and scalability. Third, experiments on various name datasets are carried out. The results show that our scheme can achieve high-speed lookup performance nearly equal to the given TCAM throughput. Notably, PtCAM can be implemented on TCAM-based IP linecards, which means that industrial commercial off-the-shelf devices can be directly involved in the advancement of information-centric innovation and deployment.
Scaling SCIERA: A Journey Through the Deployment of a Next-Generation Network
François Wirz (ETH Zurich); Marten Gartner (OVGU Magdeburg); Jelte van Bommel, Elham Ehsani Moghadam (ETH Zurich); Grace H. Cimaszewski (Princeton); Anxiao He, Yizhe Zhang (UVA); Henry Birge-Lee (Princeton); Felix Kottmann (ETH Zurich); Cyrill Krähenbühl (Princeton); Jonghoon Kwon (ETH Zurich); Kyveli Mavromati (Anapaya Systems); Liang Wang (Princeton); Daniel Bertolo (SWITCH); Marco Canini (KAUST); Buseung Cho (KISTI); Ronaldo A. Ferreira (UFMS); Simon Peter Green (SingAREN); David Hausheer (OVGU Magdeburg); Junbeom Hur (Korea University); Xiaohua Jia (CityUHK); Heejo Lee (Korea University); Prateek Mittal (Princeton); Omo Oaiya (WACREN); Chanjin Park (KISTI); Adrian Perrig (ETH Zurich); Jerry Sobieski (GMU / Sobieski Systems & Services); Yixin Sun (UVA); Cong Wang (CityUHK); Klaas Wierenga (GÉANT)
Abstract: The SCION Next-Generation Network (NGN) architecture has expanded steadily since 2017, with today 20+ ISPs offering SCION connectivity. In production, IP-to-SCION-to-IP translation by SCION-IP-Gateways (SIGs) is used, such that applications are unaware of the NGN communication. To accelerate innovation and deployments, our aim is to increase the number of native SCION use cases, where the application is fully SCION-aware and optimizes communication across all path choices offered by the network. We set out to achieve two core objectives: (1) facilitating simple native connectivity for applications, and (2) enhancing the scalability of SCION deployment at academic sites.
With these goals in mind, we built the SCION Education, Research, and Academic (SCIERA) network infrastructure. This paper presents key lessons learned from the SCIERA deployment, which we anticipate will offer actionable insights to researchers, network operators, and system builders seeking to overcome practical challenges also for other NGN deployments. We report on establishing native SCION connectivity at research and education institutions that can reach 250,000 people across five continents, without relying on BGP. Our evaluation demonstrates that our core objectives were reached. Today, the SCIERA deployment offers tangible real-world benefits to users by providing rich global connectivity through a multitude of inter-domain paths.
RANBooster: Democratizing advanced cellular connectivity through fronthaul middleboxes
Abstract: The 5G Radio Access Network has shifted towards virtualization and disaggregation. This change aims to reduce costs and foster innovation by promoting vendor interoperability and by expanding the ecosystem. In this environment, smaller RAN vendors and open-source projects have emerged, focusing on low-cost, modular stacks. However, challenges such as achieving state-of-the-art performance and accessing data and control knobs hinder their widespread adoption. To address these issues, we propose a middlebox architecture, called RANBooster, that enhances the RAN capabilities without modifying existing network functions, by leveraging the open fronthaul interface. To demonstrate the benefits of the RANBooster framework, we build four reference applications (distributed antenna system, distributed MIMO, RU sharing, real-time physical resource block monitoring), and evaluate them on an enterprise-scale, commercial-grade 5G testbed.
13:30 — 15:30 | NetMon & Hyperscalers
Session Chair: Andrew Moore
Understanding and Profiling CXL.mem Using PathFinder
Xiao Li (University of Wisconsin-Madison/Beihang University); Zerui Guo (University of Wisconsin-Madison); Yuebin Bai (Beihang University); Mahesh Ketkar, Hugh Wilkinson (Intel); Ming Liu (University of Wisconsin-Madison)
Abstract: CXL.mem and the resulting memory pool are promising and becoming widely deployed. Unlike local DIMM, CXL DIMMs stay at the I/O subsystem (i.e., FlexBus) and can easily stall the processor pipeline and memory subsystem, jeopardizing co-located applications. However, our community lacks a tool to understand and profile CXL.mem execution end-to-end between CPU and remote DIMM.
This paper presents PathFinder, a systematic utility to capture the CXL.mem execution in a server system. Our key idea is to model the end-to-end CXL.mem data path as an execution graph based on architectural performance counters and apply graph analytical techniques. PathFinder associates CXL and competing memory flows with underlying hardware, creates a series of snapshots, and identifies the performance bottlenecks. We apply PathFinder to a number of real-world use cases and demonstrate its efficacy in application profiling, bottleneck analysis, and performance anomaly diagnosis.
PreTE: Traffic Engineering with Predictive Failures
Congcong Miao (Tencent); Zhizhen Zhong (Massachusetts Institute of Technology); Yiren Zhao (Tencent); Arpit Gupta (UCSB); Ying Zhang (Meta); Sirui Li, Zekun He, Xianneng Zou (Tencent); Jilong Wang (Tsinghua university)
Abstract: Fiber links in wide-area networks (WANs) are exposed to complicated environments and hence are vulnerable to failures like fiber cuts. The conventional approach of using static probabilistic failures falls short in fiber-cut scenarios because these fiber cuts are rare but disruptive, making it difficult for network operators to balance network utilization and availability in WAN traffic engineering. Our large-scale measurements of per-second optical-layer data reveal that the fiber’s failure probability increases by several orders of magnitude when experiencing a rare and ephemeral degradation state. Therefore, we present a novel traffic engineering (TE) system called PreTE to factor in the dynamic fiber cut probabilities directly into TE systems. At the core of the PreTE system, fiber degradation facilitates failure predictions and traffic tunnels to be proactively updated, followed by traffic allocation optimizations among updated tunnels. We evaluate PreTE using a production-level WAN testbed and large-scale simulations. The testbed evaluation quantifies PreTE’s runtime to demonstrate the feasibility to implement in large-scale WANs. Our large-scale simulation results show that PreTE can support up to 2× more demand at the same level of availability as compared to existing TE schemes.
S2: A Distributed Configuration Verifier for Hyper-Scale Networks
Dan Wang, Peng Zhang, Wenbing Sun, Wenkai Li (Xi'an Jiaotong University); Xing Feng (NorthWest University); Hao Li (Xi'an Jiaotong University); Jiawei Chen, Weirong Jiang, Yongping Tang (ByteDance)
Abstract: Network configuration verifiers can proactively reason about a network's correctness to prevent network outages. However, even recent efforts have proposed algorithms to "scale up" the verification to several thousand switches, these algorithms still cannot be used for networks with more than 10K switches or 1000M routes, which is common for large service providers. In this paper, instead of further scaling up the verification limited to a single server, we study how to "scale out" the verification using the resources of multiple servers. To achieve this, we propose S2, a distributed verifier for network configurations. S2 partitions the network model and distributes the verification tasks, i.e., control plane simulation and data plane verification, to run on multiple servers in parallel. Additionally, S2 uses prefix sharding during control plane simulation to further reduce the memory footprint on each server. We implement a prototype of S2 based on Batfish, the state-of-the-art network verifier. Based on real datacenter topologies of a large service provider and synthetic FatTree topologies, we show that S2 can verify networks with 10K routers and 1000M routes within 2 hours.
New Evolution of Hoyan: Enhancing Scalability, Usability, and Accuracy for Alibaba's Global WAN Verification
Yifei Yuan, Fangdan Ye, Yifan Li, Jingkai Zhang, Mengqi Liu, Yuyang Sang, Ruizhen Yang, Duncheng She, Zhiqing Ye, Tianchen Guo, Xiaobo Zhu, Xinji Tang, Li Jia, Zhongyu Guan, Lingpeng Su, Ci Wang, Ruiyang Feng, Shuo Wu, Zhonghui Xie, Cheng Jin, Peng Zhang (Alibaba Cloud); Qing Ma (Alibaba Group); Xianlong Zeng, Dennis Cai, Ennan Zhai (Alibaba Cloud)
Abstract: The network verification system Hoyan has been deployed for Alibaba Cloud’s wide-area network for years and achieved considerable success in preventing misconfiguration-caused network incidents. However, recent years have seen the emergence of new challenges in scalability, usability, and accuracy for Hoyan. This paper presents the new evolution of Hoyan to address these challenges. First, to support the large increase in the number of routers and prefixes on our WAN, Hoyan’s simulation has evolved from a centralized fashion to a distributed framework, which improves the efficiency by 5 times and can scale to O(104)O(104) routers, millions of prefixes, and billions of flows. Second, to improve Hoyan’s usability in checking route change intents, we developed a specification language RCL, which supports the expressive specification and automatic verification of route change intents. Third, to ensure high accuracy, we enhanced Hoyan’s accuracy diagnosis framework, which helped us identify and fix dozens of implementation and modeling issues. Hoyan is used on a daily basis for our WAN. It supports O(100)O(100) verification requests each week, prevents O(10)O(10) incidents each year, and helps reduce the percentage of misconfiguration-caused network incidents from 56% to 5%.
Lessons learned from operating a large network telescope
Alexander Männel, Jonas Mücke (TU Dresden); KC Claffy (CAIDA/UC San Diego); Max Gao (UC San Diego); Ricky K. P. Mok (CAIDA/UC San Diego); Marcin Nawrocki (NETSCOUT); Thomas C. Schmidt (HAW Hamburg); Matthias Wählisch (TU Dresden)
Abstract: Network telescopes (aka darknets) collect unsolicited Internet traffic (aka Internet background radiation or IBR), which includes benign and malicious scanning as well as artifacts of spoofed denial-of-service attacks and misconfigured software and hosts. Analysis of this traffic has revealed macroscopic insights into security-related events and global network dynamics such as outages. Operating a large-scale network telescope is challenging but often taken for granted, more so than in more mature scientific disciplines. We offer the first study documenting our experiences operating the UCSD Network Telescope, the largest and longest-operating network telescope supporting scientific research. We provide background on the history of the telescope, and focus on increasing operational challenges as the underlying network evolves. We develop and apply techniques to leverage third-party scanning activity to validate the integrity of the data, and to discover misconfigurations in the instrumentation. These insights are crucial for understanding measurement results, which we illustrate using concrete examples. We discuss how our findings generalize to support the expanding ecosystem of other passive techniques, such as honeypots, to track security phenomena.
Unlocking Superior Performance in Reconfigurable Data Center Networks with Credit-Based Transport
Federico De Marchi, Jialong Li (Max Planck Institute for Informatics); Ying Zhang (Meta); Wei Bai (NVIDIA); Yiting Xia (Max Planck Institute for Informatics)
Abstract: The large-scale, end-to-end implementation of microsecond-switched reconfigurable data center networks (RDCNs), coupled with innovative routing and topology designs that provide continuous routes abstracting away frequent topology changes, demonstrates promise as a viable alternative to Clos networks in the post-Moore’s Law era. However, the gap remains in transport performance, with current transport solutions falling short of unlocking their full performance. In this paper, we introduce Flare, a novel credit-based transport protocol that ensures reliable traffic delivery, low latency, and leverages the rapidly reconfiguring circuits of the RDCN to opportunistically route traffic over short paths, maximizing throughput. In simulations, Flare enables RDCNs to outperform Clos networks, achieving up to 1.15× higher throughput even under adversarial traffic. Additionally, it delivers up to 2× and 1.5× higher throughput than NDP and ExpressPass, and up to 10×, 15×, and 3.5× shorter flow completion time (FCT) than ExpressPass, TDTCP, and Bolt. Our testbed implementation further demonstrates the feasibility of Flare’s mechanisms with programmable switches and DPDK.
Day 4 - Thursday, September 11 2025
08:30 — 10:30 | Network Architecture & Satellites
Session Chair: Marwan Fayed
From ATOP to ZCube: Automated Topology Optimization Pipeline and A Highly Cost-Effective Network Topology for Large Model Training
Zihan Yan, Dan Li (Tsinghua University); Li Chen (Zhongguancun Laboratory); Dian Xiong (Harnets.AI); Kaihui Gao (Zhongguancun Laboratory); Yiwei Zhang, Rui Yan (Tsinghua University); Menglei Zhang, Bochun Zhang, Zhuo Jiang, Jianxi Ye, Haibin Lin (ByteDance)
Abstract: The development of large language models (LLMs) poses new challenges in data center network topology design. To assist in exploring topology design, we propose ATOP, an Automated Topology Optimization Pipeline, which models network topology as a set of hyperparameters, enabling the discovery of potential topologies. With various optimization algorithms and customizable optimization objectives, ATOP achieves automated topology optimization on a scale of tens of thousands of GPUs. We apply ATOP on network topologies for 256256, 10241024, 40964096, and 1638416384 GPUs, optimizing performance under LLMs training traffic patterns, collective communication performance, fault tolerance, and network cost. We also evaluate ATOP in different scenarios: building, optimizing, and expanding a data center. From ATOP's results, we discover a new topology --- ZCube, which reaches the highest cost-effectiveness across various GPU scales. Simulation results show that ZCube, compared to the previous state-of-the-art topologies, including Rail-optimized Fat-tree (ROFT), Rail-only, and HPN, improves end-to-end LLM training speed by 3% to 7% and reduces network hardware costs by 26% to 46%. We also construct ZCube on a real-world testbed. Results show that ZCube reduces hardware costs by 25% compared to Rail-Optimized Topology while maintaining the same all-reduce and all-to-all performance.
Reliable and Decentralized Certificate Revocation via DNS: The Case for RevDNS
Protick Bhowmick (Virginia Tech); Dave Levin (University of Maryland); Taejoong Chung (Virginia Tech)
Abstract: The Online Certificate Status Protocol’s long slide—after 25 years of soft-fail rules, privacy leakage, and shaky infrastructure—exposes a deeper failure in web-PKI revocation. Certificate Authorities increasingly route OCSP traffic through CDNs for speed, yet this recentralizes trust: our measurements show Akamai serves 62 percent of all revocation responses, creating single points of failure and betraying PKI’s decentralized ideals.
We present RevDNS, a DNS-based revocation scheme that drops CDN dependence while preserving real-time guarantees. Revoked serial numbers live in DNSSEC-signed TXT records; NSEC proofs allow aggressive negative caching, so recursive resolvers answer 99.8 percent of checks without bothering a CA. From 1.1 billion certificates and 5 million revocations, we find a large CA such as Let’s Encrypt can publish data for 612 million certificates in a 345MB zone, with resolvers shouldering nearly every lookup.
Because answers piggyback on ordinary DNS lookups, RevDNS adds no latency and discloses no more about users than standard DNS traffic. By keeping revocation authority with CAs and avoiding fragile hacks like short-lived certificates, RevDNS delivers a durable, decentralized path for TLS revocation—one that finally aligns operational practicality with the web’s security ambitions.
SaTE: Low-Latency Traffic Engineering for Satellite Networks
Hao Wu, Yizhan Han (Department of Computer Science, National University of Singapore); Mohit Rajpal (National University of Singapore); Qizhen Zhang (University of Toronto); Jingxian Wang (National University of Singapore)
Abstract: This paper explores traffic engineering (TE) for large-scale Low-Earth-Orbit satellite constellations. While there is rich prior work on TE algorithms for global cloud wide-area networks (WANs), they are designed for static network topologies and often require significant computation time for large-scale networks. Such limitations make existing WAN TE algorithms unsuitable for large-scale satellite networks which rapidly change topology and require computing optimal traffic allocation under stringent latency constraints.
We present SaTE, a low-latency TE algorithm for large-scale satellite networks, computing traffic allocation at millisecond latency. SaTE formulates a heterogeneous graph to model the TE problem, adapting to dynamic satellite topologies. By removing redundant graph relations, SaTE reduces computational latency, allowing the graph to be efficiently learned by a graph neural network that leverages GPUs to rapidly infer traffic allocations. SaTE also exploits the similarity of satellite network topologies and the geospatial distribution of traffic demands to facilitate model training. We evaluate SaTE through extensive data-driven simulation on today’s largest satellite constellation, Starlink with 4236 satellites. Our results show over a 23.5% improvement in satisfied demand with an average TE runtime of 17 𝑚𝑠, achieving a 2738× speedup compared to commercial solvers.
Small-scale LEO Satellite Networking for Global-scale Demands
Abstract: Do we really need 10,000s of Low Earth Orbit (LEO) satellites to meet huge global Internet demands? While proven feasible and valuable, such LEO mega-constellation networks have raised concerns about their prohibitive capital expenditures, market monopoly, and unsustainable use of space. Instead, our analysis reveals that most of their satellites can be wasted due to their mismatch with physically uneven demands. We thus propose TinyLEO, a software-defined solution to shrink LEO network size for enormous global demands via dynamic spatiotemporal supply-demand matching. TinyLEO sparsifies satellite supplies on demand by combining diverse yet sparse orbits, hides the complexities of this sparse LEO network via orbital model predictive control, and shifts the responsibility for handling these complexities to its geographic segment anycast for higher network usability, lower resource wastes, faster failovers, simpler satellites, and more flexible network orchestration. We have prototyped TinyLEO as a community toolkit for open research. Our evaluation using this toolkit shows that TinyLEO can compress the existing LEO mega-constellation network size by 2.0-7.9×, cut control plane costs by 1–3 orders of magnitude, and maintain the same demands and comparable data plane performance.
Direct-to-Cell Satellite Network without Satellite Navigation
Wei Liu, Yuanjie Li, Jingyi Lan, Hewu Li, Yimei Chen, Lixin Liu, Jiabo Yang, Xi Long, Li Ouyang, Minghao Tang, Jianping Wu, Qian Wu, Jun Liu, Zeqi Lai (Tsinghua University)
Abstract: Direct-to-cell satellites enable global network services for our regular phones/IoTs via 4G, 5G, and beyond. To enforce highly available, trustworthy, and roaming policy-compliant network services, they heavily rely on user geolocation and timing information from external global navigation satellite systems (GNSS) to assist with their radio access, authentication, and authorization. Our analysis and field tests reveal that, this cross-technology over-reliance propagates satellite navigation’s defects to direct-to-cell satellite networks, leading to diverse issues such as intermittent connectivity, over/under-billing, unauthorized services, and service denials even when direct-to-cell satellites are accessible. Our solution, SN2SN2, adopts the "fate-sharing" principle to reuse direct-to-cell satellites themselves for self-navigating networks. By exploiting the flexible tradeoffs between satellite network availability and navigation accuracy, it enables "good enough" built-in navigation for highly available and functionally correct network services at a negligible cost of hardware or communication resources. Our evaluations with commodity satellite phones and the 3GPP NTN protocol stacks demonstrate SN2SN2's 6.6-23.5×× network availability boost and 2.8-12.3×× access latency reduction over legacy solutions.
StarCDN: Moving Content Delivery Networks to Space
William X. Zheng, Aryan Taneja, Maleeha Masood (University of Illinois Urbana-Champaign); Anirudh Sabnis (Akamai Technologies); Ramesh K. Sitaraman (UMass Amherst & Akamai Technologies); Deepak Vasisht (University of Illinois Urbana-Champaign)
Abstract: Low Earth Orbit (LEO) satellite networks, such as Starlink, provide global internet access and currently serve content to millions of users. Recent work has shown that existing network infrastructures, such as Content Delivery Networks (CDNs), are not well-suited to satellite network architectures. Traditional terrestrial CDNs degrade performance for satellite network users and do not alleviate the congestion in the ground-satellite links. We design StarCDN, a new CDN architecture that caches content in space to improve user experience and reduce ground-satellite bandwidth usage. The fundamental challenge in designing StarCDN lies in the orbital motion of satellites, which causes each satellite’s coverage area to change rapidly, serving vastly different regions (e.g., US and Europe) within minutes. To address this, we introduce new consistent hashing and relayed fetching schemes tailored to LEO satellite networks. Our design enables cached content to flow in the opposite direction of the orbital motion to counter satellite motion. We evaluate StarCDN against multiple baselines using real-world traces from Akamai. Our evaluation demonstrates that StarCDN can reduce the ground-to-satellite bandwidth utilization by 80% and improve user-perceived latency by 2.5X. Further, we make available an open-source trace generator, SpaceGEN, for realistic simulations of satellite-based CDNs.
08:30 — 10:30 | NetAI & Wireless
Session Chair: Andreas Kassler
ByteScale: Communication-Efficient Scaling of LLM Training with a 2048K Context Length on 16384 GPUs
Hao Ge (Peking University); Junda Feng, Qi Huang (ByteDance Inc.); Fangcheng Fu (Shanghai Jiao Tong University); Xiaonan Nie, Lei Zuo, Haibin Lin (ByteDance Inc.); Bin Cui (Peking University); Xin Liu (ByteDance Inc.)
Abstract: Scaling long-context ability is essential for Large Language Models (LLMs). To amortize the memory consumption across multiple devices in long-context training, inter-data partitioning (a.k.a. Data Parallelism) and intra-data partitioning (a.k.a. Context Parallelism) are commonly used. Current training frameworks predominantly treat the two techniques as orthogonal, and establish static communication groups to organize the devices as a static mesh (e.g., a 2D mesh). However, the sequences for LLM training typically vary in lengths, no matter for texts, multi-modalities or reinforcement learning. The mismatch between data heterogeneity and static mesh causes redundant communication and imbalanced computation, degrading the training efficiency.
In this work, we introduce ByteScale, an efficient, flexible, and scalable LLM training framework for large-scale mixed training of long and short sequences. The core of ByteScale is a novel parallelism strategy, namely Hybrid Data Parallelism (HDP), which unifies the inter- and intra-data partitioning with a dynamic mesh design. In particular, we build a communication optimizer, which eliminates the redundant communication for short sequences by data-aware sharding and dynamic communication, and further compresses the communication cost for long sequences by selective offloading. Besides, we also develop a balance scheduler to mitigate the imbalanced computation by parallelism-aware data assignment. We evaluate ByteScale with the model sizes ranging from 7B to 141B, context lengths from 256K to 2048K, on a production cluster with 16384 GPUs. Experiment results show that ByteScale outperforms the state-of-the-art training system by up to 7.89×7.89×.
Making Cellular Networks More Efficient By Roaming-in-Place
Tenzin Samten Ukyab (UC Berkeley); Lisa Suzuki (Waseda University); Demetrius Davis (Virginia Tech); Zhihong Luo, Silvery Fu (UC Berkeley); Shaddi Hasan (Virginia Tech); Sylvia Ratnasamy (UC Berkeley); Scott Shenker (ICSI AND UC Berkeley)
Abstract: We propose Roaming-in-Place (RinP), a technique for dynamically sharing capacity across mobile network operators. RinP is a new form of infrastructure sharing that expands the traditional notion of roaming in cellular networks such that users may roam between operators with overlapping coverage areas based on load and performance conditions. Using simulation and small-scale experiments, we show that deploying RinP would allow operators to run their networks at higher utilization and provide users with higher availability and performance, while achieving 30-40% infrastructure savings in our typical evaluation scenarios. We present a design for RinP that can be incrementally deployed with modest changes to existing cellular infrastructure. We build a prototype RinP testbed, and show that our proposed design can be realized feasibly with modest changes to existing cellular infrastructure, requires no change to current protocol standards, and adds minimal latency overheads.
Enabling Over-the-Air AI for Edge Computing via Metasurface-Driven Physical Neural Networks
Abstract: We present MetaAI, a novel wireless computing paradigm that integrates neural network computation directly into wireless signal propagation. Unlike traditional approaches that treat wireless channels as mere data conduits, MetaAI transforms them into active computing elements through programmable metasurfaces, enabling concurrent data transmission and neural network processing. By leveraging the inherent linearity of both wireless propagation and neural networks, our design resolves the fundamental mismatch between sequential wireless transmission and parallel neural computation, while supporting efficient multi-sensor late-stage data fusion. We implemented MetaAI using metasurfaces at both dual-band (2.4/5 GHz) and single-band (3.5 GHz) frequencies. Extensive experiments demonstrate robust performance across diverse classification tasks, achieving 82.8% average accuracy (up to 89.8%) even with a simple linear architecture. Multi-sensor fusion further improves accuracy by up to 27.06%. MetaAI represents a fundamental shift in Edge AI architecture, where wireless infrastructure becomes an integral part of the computing pipeline.
Towards Next-Generation Global IoT: Empowering Massive Connectivity with Harmonious Multi-Network Coexistence
Ziyue Zhang, Xianjin Xia, Ruonan Li, Yuanqing Zheng (The Hong Kong Polytechnic University)
Abstract: LoRaWAN offers a compelling solution for delivering cost-effective network access to millions of IoT devices worldwide. However, operators face challenges in scaling their services to meet the growing demands of IoT connections. Moreover, current LoRaWANs foster competition rather than cooperation among coexisting networks, resulting in substantial capacity degradation as network density increases. To identify the root causes limiting LoRaWAN scalability and to enable harmonious coexistence among network operators, this paper conducts an in-depth investigation of operational LoRaWANs. For the first time, our study reveals that the capacity degradation in LoRaWAN is not due to traditionally believed issues (such as wireless contention or interference) but rather a newly-identified decoder contention problem. This problem cannot be resolved using conventional approaches and hinders the scaled deployment of LoRaWANs as a global IoT infrastructure. Based on our new findings, we propose design principles that guide our exploration for effective strategies to address this emerging practical problem. We develop concrete deployable solutions to mitigate contention, optimize spectrum utilization, and promote spectrum sharing among network operators. Extensive evaluations demonstrate that our strategies effectively boost network capacity close to the theoretical bound, and support the coexistence of up to six networks with significant improvement in spectrum efficiency.
Abstract: Underwater backscatter is a promising technology for ultra-low-power underwater networking, but existing systems break down in mobile scenarios. This paper presents EchoRider, the first system to enable reliable underwater backscatter networking under mobility.
EchoRider introduces three key components. First, it incorporates a robust and energy-efficient downlink architecture that uses chirp-modulated transmissions at the reader and a sub-Nyquist chirp decoder on backscatter nodes - bringing the resilience of LoRa-style signaling to underwater backscatter while remaining ultra-low-power. Second, it introduces a NACK-based full-duplex retransmission protocol, enabling efficient, reliable packet delivery. Third, it implements a Doppler-resilient uplink decoding pipeline that includes adaptive equalization, polar coding, and dynamic retraining to combat channel variation.
We built a full EchoRider prototype and evaluated it across over 1,200 real-world mobile experiments. EchoRider improves bit error rate by over 125× compared to a state-of-the-art baseline and maintains underwater goodput of 0.8 kbps at speeds up to 2.91 knots. In contrast, the baseline fails at speeds as low as 0.17 knots. Finally, we demonstrate EchoRider in end-to-end deployments involving mobile drones and sensor nodes, showing its effectiveness in practical underwater networked applications.
Acoustic Backscatter Network for Vehicle Body-in-white
Weiguo Wang (NIO, Tsinghua University); Yuan He, Yadong Xie, Chuyue Xie (Tsinghua University); Yi Kai, Chengchen Hu (NIO)
Abstract: This paper presents a novel approach to monitor the structure of the Body in White (BiW), the fundamental metallic structure of a vehicle. Existing monitoring methods, including both wired and wireless sensor systems, face significant challenges due to integration complexity, weight considerations, material costs, and signal blockage within the metallic environment. To overcome these limitations, we introduce ArachNet, an acoustic backscatter network that leverages the conductive properties of the BiW to propagate vibration signals for energy transfer and data communication. This system comprises battery-free tags that harvest energy from BiW vibrations and utilize a backscatter technique for efficient communication, thereby eliminating the need for external power sources and reducing the power consumption. We address key challenges such as power sufficiency for tag activation and sustained operation, and collision reduction in network communication, by designing an ultra-low power backscatter tag and a distributed slot allocation protocol. We implement the prototype of ArachNet, and deploy 12 tags onto the BiW of an electric SUV car. The evaluation results show that the power consumption of the tag is 51.0 μW for transmission and 24.8 μW for reception. With our network protocol, the slot utility can be up to 81.2%.
13:30 — 15:10 | Distributed Systems
Session Chair: Simone Ferlin-Reiter
Low-Overhead Distributed Application Observation with DeepTrace: Achieving Accurate Tracing in Production Systems
Yantao Geng, Han Zhang, Zhiheng Wu (Tsinghua University); Yahui Li (Tsinghua University,China); Jilong Wang (Tsinghua university); Xia Yin (Tsinghua University)
Abstract: As microservices grow in scale and complexity, their operation and debugging become increasingly challenging. Even a single user request can involve interactions across hundreds of components. In such intricate systems, distributed tracing, which tracks the end-to-end execution flow of requests, has become a critical monitoring tool. Among these, non-intrusive tracing frameworks that do not require code modification are particularly valued for their convenience. However, existing non-intrusive solutions either have limited applicability or lack sufficient accuracy under high concurrency. To address these challenges, we propose DeepTrace, a transaction-based, non-intrusive distributed tracing framework designed for microservices. DeepTrace leverages API endpoints and transaction fields embedded within request content to categorize requests into distinct transactions, thereby reducing the likelihood of incorrectly merging traces from different transactions. Compared to state-of-the-art frameworks, DeepTrace maintains an accuracy rate of over 95% even under high concurrency. It has also been adopted by dozens of companies in their production systems for tasks such as failure diagnosis and resource optimization.
Hummingbird: Fast, Flexible, and Fair Inter-Domain Bandwidth Reservations
Karl Wüst, Giacomo Giuliari, Markus Legner, Jean-Pierre Smith (Mysten Labs); Marc Wyss, Jules Bachmann, Juan A. Garcia-Pardo (ETH Zurich); Adrian Perrig (Mysten Labs)
Abstract: Over the past decade, Internet centralization and its implications for privacy, resilience, and innovation have become a topic of active debate. While the networking community informally agrees on the definition of centralization, we lack a formal metric for quantifying it, which has limited in-depth analysis. In this work, we introduce a rigorous statistical metric for Internet centralization. In doing so, we also uncover how regionalization—geopolitical dependence on the Internet—fundamentally affects centralization. We argue that centralization and regionalization are intertwined forms of dependence that both affect the lived experiences of users and should be jointly studied. We develop a suite of statistical tools, which we use to better understand dependence across three layers of web infrastructure—hosting providers, DNS infrastructure, certificate authorities—in 150 countries. We hope that this statistical toolkit can serve as the foundation for future analysis of Internet behavior.
Inter-domain Routing with Extensible Criteria
Seyedali Tabaeiaghdaei (Anapaya Systems AG); Jelte van Bommel, Marc Wyss (ETH Zurich); João Luis Sobrinho (Instituto de Telecomunicações, Instituto Superior Técnico); Giovanni Barbiero (UBS AG); Giacomo Giuliari (Mysten Labs); Ahad N. Zehmakan (Australian National University); Adrian Perrig (ETH Zurich)
Abstract: With the rapid evolution and diversification of Internet applications, their communication-quality criteria are continuously evolving. To globally optimize communication quality, the Internet's control plane thus needs to optimize inter-domain paths on diverse criteria, and should provide extensibility for adding new criteria or modifying existing ones. However, existing inter-domain routing protocols and proposals satisfy these requirements at best to a limited degree.
We argue that an inter-domain routing architecture with extensible routing criteria can be realized in path-aware networks, due to their stateless forwarding. We thus propose IREC, an inter-domain routing architecture for the SCION path-aware Internet architecture that enables path optimization with extensible criteria. IREC achieves this through parallel execution and real-time addition of independent route computations, together enabling end domains to express their desired criteria to the control plane. We show IREC's viability via implementation and emulation, and its negligible global cost compared to static routing protocols through large-scale simulations with realistic Internet topologies.
Network Support For Scalable And High Performance Cloud Exchanges
Muhammad Haseeb (New York University); Jinkun Geng (Clockwork Systems, Inc); Daniel Duclos-Cavalcanti (Technical University of Munich); Xiyu Hao, Ulysses Butler (New York University); Radhika Mittal (University of Illinois Urbana-Champaign); Srinivas Narayana (Rutgers University); Anirudh Sivaraman (New York University)
Abstract: We present Onyx, a system for meeting the networking requirements of financial exchanges on the public cloud. Onyx uses an overlay tree to multicast market data from an exchange to 1000 participants with ≤ 1 μs difference in data reception time between any two participants, crucial for maintaining fair competition. Onyx reuses the same tree for scalable inbound communication (participants to exchange), introducing a scheduling policy to enhance an exchange’s throughput during periods of bursty traffic. It also presents a message sequencing mechanism that achieves globally ordered delivery of messages from participants to the exchange, i.e., messages of participants are seen by the exchange in order they are generated to maintain fairness. Onyx achieves better scalability and ≈50% lower latency than the AWS multicast service [ 1 ]. Onyx outperforms an existing system, CloudEx [ 2 ] in terms of supported number of participants, order matching rate of the exchange and multicast latency. Onyx’s techniques can be applied to other existing systems (e.g., DBO) to enhance their performance.
Abstract: Over the past decade, Internet centralization and its implications for privacy, resilience, and innovation have become a topic of active debate. While the networking community informally agrees on the definition of centralization, we lack a formal metric for quantifying it, which has limited in-depth analysis. In this work, we introduce a rigorous statistical metric for Internet centralization. In doing so, we also uncover how regionalization—geopolitical dependence on the Internet—fundamentally affects centralization. We argue that centralization and regionalization are intertwined forms of dependence that both affect the lived experiences of users and should be jointly studied. We develop a suite of statistical tools, which we use to better understand dependence across three layers of web infrastructure—hosting providers, DNS infrastructure, certificate authorities—in 150 countries. We hope that this statistical toolkit can serve as the foundation for future analysis of Internet behavior.
13:30 — 15:10 | Video & QoE
Session Chair: Dongsu Han
Towards User-level QoE: Large-scale Practice in Personalized Optimization of Adaptive Video Streaming
Lianchen Jia (Tsinghua University); Chao Zhou (Kuaishou); Chaoyang LI (Tsinghua University); Jiangchuan Liu (Simon Fraser University); Lifeng Sun (Tsinghua University)
Abstract: Traditional optimization methods based on system-wide Quality of Service (QoS) metrics have approached their performance limitations in modern large-scale streaming systems. However, aligning user-level Quality of Experience~(QoE) with algorithmic optimization objectives remains an unresolved challenge. Therefore, we propose \texttt{LingXi}, the first large-scale deployed system for personalized adaptive video streaming based on user-level experience. \texttt{LingXi} dynamically optimizes the objectives of adaptive video streaming algorithms by analyzing user engagement. Utilizing exit rate as a key metric, we investigate the correlation between QoS indicators and exit rates based on production environment logs, subsequently developing a personalized exit rate predictor. Through Monte Carlo sampling and online Bayesian optimization, we iteratively determine optimal parameters. Large-scale A/B testing utilizing 8% of traffic on Kuaishou, one of the largest short video platforms, demonstrates \texttt{LingXi}'s superior performance. \texttt{LingXi} achieves a 0.15% increase in total viewing time, a 0.1% improvement in bitrate, and a 1.3% reduction in stall time across all users, with particularly significant improvements for low-bandwidth users who experience a 15% reduction in stall time.
TLadder: QoE-Centric Video Ladder Optimization with Playback Feedback at Billion Scale
Abstract: We operate one of the largest short video streaming platforms in the world, serving billions of global users every day. Similar to other platforms, we employ Adaptive Bitrate (ABR) streaming, which adaptively selects an appropriate representation from a group of video representations forming a bitrate ladder. In this paper, instead of developing yet another ABR algorithm, we optimize the bitrate ladder configuration, which constitutes the decision space of ABR, by developing a system called TLadder.
TLadder explicitly maximizes the quality-of-experience (QoE) that a ladder is expected to incur, through a principled optimization framework with polynomial complexity and provable optimality. TLadder’s optimization jointly considers the video content dimension (i.e., the bitrate-quality tradeoff of candidate representations) and the playback feedback dimension (e.g., network condition, re-buffering time, and playback bitrate). TLadder also introduces a high-fidelity simulator that provides the optimizer with accurate QoE estimation. We evaluate TLadder via a large-scale A/B test across 0.24 billion users. TLadder achieves a 6.2–20% reduction in rebuffering time, a 2.7–3.5 unit increase in visual quality (in an 85-100 unit range), and a 1–4% reduction in CDN traffic cost, compared to our previous bitrate laddering approach. It also outperforms existing state-of-the-art approaches based on extensive trace-driven simulation.
ACE: Sending Burstiness Control for High-Quality Real-time Communication
Xiangjie Huang, Jiayang Xu (Hong Kong University of Science and Technology); Haiping Wang, Hebin Yu (ByteDance); Sandesh Dhawaskar Sathyanarayana, Shu Shi (Bytedance); Zili Meng (Hong Kong University of Science and Technology)
Abstract: Real-time communications (RTC) require ultra-low latency while consistently preserving high quality. However, we observe that as content variability increases and RTT decreases, long-tail queuing latency emerges in the sender’s pacing queue between the video encoder and the network. This issue stems from a mismatch between the traffic pattern from the encoder and sent to the network. Specifically, the encoder produces bursty frames with highly fluctuating sizes. Existing approaches mitigate this by smoothing the bitrate via pacing, but this often necessitates a trade-off between video quality and latency. To address this, we propose a coordinated approach that manages both encoding complexity and sending pace. On the transmission side, we dynamically adjust the bucket size of a token-based pacer to control burstiness within the frame level. On the encoder side, we introduce an adaptive complexity mechanism that smoothens frame sizes without sacrificing quality. Through trace-driven emulation and real-world experiments, our solution ACE, reduces end-to-end 95th percentile latency by up to 43% while maintaining superior video quality compared to state-of-the-art alternatives.
Harnessing WebRTC for Large-Scale Live Streaming
Wei Zhang (ByteDance); Tong Meng, Xianhua Zeng, Wei Yang, Changqing Yan, Chao Li, Chenguang Li (ByteDance Inc.); Feng Qian (ByteDance); Junfeng Yang (Hunan University of Technology and Business); Lei Zhang (Shenzhen University); Zhi Wang (Shenzhen International Graduate School, Tsinghua University)
Abstract: Live streaming that supports real-time interaction has become increasingly popular. To support the ensuing requirements on low end-to-end latency, RTM, the state-of-the-art live streaming system at Douyin, replaces the HTTP-FLV streaming protocol with WebRTC. To tailor the WebRTC stack to the live streaming scenario, we focus on optimizing first-frame delay, startup video rebuffering, audio-to-video drift, and per-session CPU usage. Those are the top-priority metrics identified from an importance analysis with respect to two user engagement metrics, i.e., viewer penetration and viewing time. To date, WebRTC-based streaming in RTM has been in operation for 4 years, and serves billions of viewer sessions every day. It dramatically optimizes QoE metrics (e.g., end-to-end latency reduced by 54.5%), and delivers statistically significant user engagement gains (e.g., number of paid orders increased by 0.8%). In this paper, we report our deployment experiences comprehensively.
Scalable Video Conferencing Using SDN Principles
Oliver Michel, Satadal Sengupta (Princeton University); Hyojoon Kim (University of Virginia); Ravi Netravali, Jennifer Rexford (Princeton University)
Abstract: Video-conferencing applications face an unwavering surge in traffic, stressing their underlying infrastructure in unprecedented ways. This paper rethinks the key building block for conferencing infrastructures - selective forwarding units (SFUs). SFUs relay and adapt media streams between participants and, today, run in software on general-purpose servers. Our main insight, discerned from dissecting the operation of production SFU servers, is that SFUs largely mimic traditional packet-processing operations such as dropping and forwarding. Guided by this, we present Scallop, an SDN-inspired SFU that decouples video-conferencing applications into a hardware-based data plane for latency-sensitive and frequent media operations, and a software control plane for the (infrequent) remaining tasks, such as analyzing feedback signals and session management. Scallop is a general design that is suitable for a variety of hardware platforms, including programmable switches and SmartNICs. Our Tofino-based implementation fully supports WebRTC and delivers 7-422x improved scaling over a 32-core commodity server, while reaping performance improvements by cutting forwarding-induced latency by 26x. We also present an implementation of Scallop on the BlueField-3 SmartNIC.