ACM SIGCOMM 2023, New York City, US

ACM SIGCOMM 2023 Program

The conference proceedings are available in the ACM DL.


  • Sunday, September 10, 2023

  • 8:00am-9:30am     Breakfast

    Location: Carleton Lounge

  • 9:00am-5:30pm     Workshops and Tutorials

    Please check the individual page for the detailed program of each workshop or tutorial. The workshops and tutorials will happen in the Seeley W. Mudd building on the Columbia University Morningside Heights campus. Directions to the building can be found here. Only the south entrance (4th floor, campus-side, shown in the map) will be open to workshop and tutorial participants. The entrance from 120th Street (1st floor) requires swipe access. There will be a registration desk directly behind the entrance. The building will be open between 8:00am and 9pm.

  • Workshops:
    5G-MeMU (Location: 545 Mudd) eBPF (Location: 233 Mudd) EMS (Location: 524 Mudd) FIRA (Location: 833 Mudd) IIoT-Nets (Location: 1024 Mudd) QuNets (Location: 1127 Mudd) SNIP2+ (Location: 825 Mudd) N2Women (Location: 451 CSB)

    ML4Nets (Location: 227 Mudd) FABRIC (Location: 633 Mudd) Design4Test (Location: 627 Mudd)

  • 6:00pm-9:00pm     Reception

    Location: Carleton Lounge

  • End of the day

  • Monday, September 11, 2023

  • 8:00am–10:00am     Breakfast

    Location: Lerner Hall

  • 8:30am–9:00am     Welcome Address

    Location: Lerner Hall

  • 9:00am–10:30am     Introduction and Keynote by Dina Katabi

    Location: Lerner Hall

  • 10:30am-10:45am     Break

  • 10:45am–11:55am     Technical Session 1: Water, Air, Blood

    Slack channel     Location: Lerner Hall

  • Enabling Long-Range Underwater Backscatter via Van Atta Acoustic Networks

    Aline Eid (Massachusetts Institute of Technology / University of Michigan), Jack Rademacher, Waleed Akbar, Purui Wang, Ahmed Allam, Fadel Adib (Massachusetts Institute of Technology)

    • Abstract: We present the design, implementation, and evaluation of Van Atta Acoustic Backscatter (VAB), a technology that enables long-range, ultra-low-power networking in underwater environments. At the core of VAB is a novel, scalable underwater backscatter architecture that bridges recent advances in RF backscatter (Van Atta architectures) with ultra-low-power underwater acoustic networks. Our design introduces multiple innovations across the networking stack, which enable it to overcome unique challenges that arise from the electro-mechanical properties of underwater backscatter and the challenging nature of low-power underwater acoustic channels. We implemented our design in an end-to-end system, and evaluated it in over 1,500 real-world experimental trials in a river and the ocean. Our evaluation demonstrates that VAB achieves a communication range that exceeds 300m in round trip backscatter across orientations (at BER of 10−3). We compared our design head-to-head with past state-of-the-art systems, demonstrating a 15× improvement in communication range at the same throughput and power. By realizing hundreds of meters of range in underwater backscatter, this paper presents the first practical system capable of coastal monitoring applications. Finally, our evaluation represents the first experimental validation of underwater backscatter in the ocean.


  • Enabling Ubiquitous WiFi Sensing with Beamforming Reports

    Artifac availableArtifacts Available     

    Chenhao Wu, Xuan Huang (The Chinese University of Hong Kong), Jun Huang (City University of Hong Kong), Guoliang Xing (The Chinese University of Hong Kong)

    • Abstract: Wi-Fi sensing systems leverage wireless signals from widely deployed Wi-Fi devices to realize sensing for a broad range of applications. However, current Wi-Fi sensing systems heavily rely on channel state information (CSI) to learn the signal propagation characteristics, while the availability of CSI is highly dependent on specific Wi-Fi chipsets. Through a city-scale measurement, we discover that the availability of CSI is extremely limited in operational Wi-Fi devices. In this work, we propose a new wireless sensing system called BeamSense that exploits compressed beamforming reports (CBR). Due to the extensive support of transmit beamforming in operational Wi-Fi devices, CBR is commonly accessible and hence enables a ubiquitous sensing capability. BeamSense adopts a novel multi-path estimation algorithm that can efficiently and accurately map bidirectional CBR to a multi-path channel based on intrinsic fingerprints. We implement BeamSense on several prevalent models of Wi-Fi devices and evaluated its performance with microbenchmarks and three representative Wi-Fi sensing applications. The results show that BeamSense is capable of enabling existing CSI-based sensing algorithms to work with CBR with high sensing accuracy and improved generalizability.


  • Underwater 3D positioning on smart devices

    Artifac availableArtifacts Available     

    Tuochao Chen, Justin Chan, Shyamnath Gollakota (University of Washington)

    • Abstract: The emergence of water-proof mobile and wearable devices (e.g., Garmin Descent and Apple Watch Ultra) designed for underwater activities like professional scuba diving, opens up opportunities for underwater networking and localization capabilities on these devices. Here, we present the first underwater acoustic positioning system for smart devices. Unlike conventional systems that use floating buoys as anchors at known locations, we design a system where a dive leader can compute the relative positions of all other divers, without any external infrastructure. Our intuition is that in a well-connected network of devices, if we compute the pairwise distances, we can determine the shape of the network topology. By incorporating orientation information about a single diver who is in the visual range of the leader device, we can then estimate the positions of all the remaining divers, even if they are not within sight. We address various practical problems including detecting erroneous distance estimates, addressing rotational and flipping ambiguities as well as designing a distributed timestamp protocol that scales linearly with the number of devices. Our evaluations show that our distributed system running on underwater deployments of 4-5 commodity smart devices can perform pairwise ranging and localization with median errors of 0.5-0.9 m and 0.9-1.6 m. Project page with code:


  • A Millimeter Wave Backscatter Network for Two-Way Communication and Localization

    Haofan Lu, Mohammad Hossein Mazaheri, Reza Rezvani, Omid Abari (UCLA)

    • Abstract: Millimeter wave (mmWave) technology enables wireless devices to communicate using very high-frequency signals. Operating at those frequencies provides larger bandwidth which can be used to enable high-data-rate links, and very accurate localization of devices. However, radios operating at high-frequencies consume significant amount of power, making them unsuitable for applications with limited energy sources. This paper presents MilBack, a backscatter network operating at mmWave bands. Backscattering is the most energy-efficient wireless communication technique, where nodes piggyback their data on an access point's signal instead of generating their own signals. Eliminating the need for signal generation significantly reduces the energy-consumption of the nodes. In contrast to past mmWave backscatter work which supports only uplink, MilBack is the first mmWave backscatter network which supports uplink, downlink and accurate localization. MilBack addresses the key challenges that prevent existing backscatter networks to enable both uplink and downlink at mmWave bands. We implemented MilBack and evaluated its performance empirically. Our results show that MilBack is capable of achieving accurate localization, uplink and downlink communication at up to 10 m while consuming only 32 mW and 18 mW, respectively.


  • Towards Practical and Scalable Molecular Networks

    Artifac availableArtifacts Available     

    Jiaming Wang (UIUC), Sevda Öğüt, Haitham Al Hassanieh (EPFL), Bhuvana Krishnaswamy (University of Wisconsin-Madison)

    • Abstract: Molecular networks have the potential to enable bio-implants and biological nano-machines to communicate inside the human body. Molecular networks send and receive data between nodes by releasing molecules into the bloodstream. In this work, we explore how we can scale molecular networks from a single transmitter single receiver paradigm to multiple transmitters that can concurrently send data to a receiver. We identify unique challenges in enabling multiple access in molecular networks that prevent us from using standard multiple access protocols. These challenges include the lack of synchronization and feedback, the non-negativity of molecular signals, the extremely long tail of the molecular channel leading to high ISI (Inter-Symbol-Interference), and the limited types of molecules that can be used for communication. We present MoMA (Molecular Multiple Access), a protocol that enables a molecular network with multiple transmitters. We introduce packet detection, channel estimation, and encoding/decoding schemes that leverage the unique properties of molecular networks to address the above challenges. We evaluate MoMA on a synthetic experimental testbed and demonstrate that it can scale up to four transmitters while significantly outperforming the state-of-the-art.


  • 12:00pm–1:30pm     Lunch

  • 1:30pm–2:30pm     Technical Session 2: BGP Configuration

    Slack channel     Location: Lerner Hall

  • Taming the transient while reconfiguring BGP

    Artifac availableArtifacts Available     

    Tibor Schneider, Roland Schmid (ETH Zürich), Stefano Vissicchio (University College London), Laurent Vanbever (ETH Zürich)

    • Abstract: BGP reconfigurations are a daily occurrence for most network operators, especially in large networks. Yet, performing safe and robust BGP reconfiguration changes is still an open problem. Few BGP reconfiguration techniques exist, and they are either (i) unsafe, because they ignore transient states, which can easily lead to invariant violations; or (ii) impractical, as they duplicate the entire routing and forwarding states, and require special hardware.

      In this paper, we introduce Chameleon, the first BGP reconfiguration framework capable of maintaining correctness throughout a reconfiguration campaign while relying on standard BGP functionalities and minimizing state duplication. Akin to concurrency coordination in distributed systems, Chameleon models the reconfiguration process with happens-before relations. This modeling allows us to capture the safety properties of transient BGP states. We then use this knowledge to precisely control the BGP route propagation and convergence, so that input invariants are provably preserved at any time during the reconfiguration.

      We fully implement Chameleon and evaluate it in both testbeds and simulations, on real-world topologies and large-scale reconfiguration scenarios. In most experiments, our system computes reconfiguration plans within a minute, and performs them from start to finish in a few minutes, with minimal overhead.


  • Lightyear: Using Modularity to Scale BGP Control Plane Verification

    Alan Tang (UCLA), Ryan Beckett (Microsoft Research), Steven Benaloh, Karthick Jayaraman, Tejas Patil (Microsoft), Todd Millstein, George Varghese (UCLA)

    • Abstract: Current network control plane verification tools cannot scale to large networks because of the complexity of jointly reasoning about the behaviors of all network nodes. We present a modular approach to control plane verification, where end-to-end network properties are verified via a set of purely local checks on individual nodes and edges. The approach targets verification of reachability properties for BGP configurations, and provides guarantees in the face of arbitrary external route announcements and, for some properties, arbitrary node/link failures. We have proven the approach correct and implemented it in a tool Lightyear. Experimentally we show Lightyear scales dramatically better than prior control plane verifiers. Further, Lightyear has been used for six months to verify properties of a major cloud provider network containing hundreds of routers and tens of thousands of edges, finding and fixing bugs in the process. To our knowledge no prior control-plane verification tool has been shown to scale to that size and complexity. Our modular approach also makes it easy to localize configuration errors and enables incremental re-verification.


  • TENSOR: Lightweight BGP Non-Stop Routing

    Congcong Miao (Tencent), Yunming Xiao (Northwestern University), Marco Canini (KAUST), Ruiqiang Dai, Shengli Zheng (Tencent), Jilong Wang (Tsinghua University, Quancheng Laboratory), Jiwu Bu (Tencent), Aleksandar Kuzmanovic (Northwestern University), Yachen Wang (Tencent)

    • Abstract: As the solitary inter-domain protocol, BGP plays an important role in today’s Internet. Its failures threaten network stability and will usually result in large-scale packet losses. Thus, the non-stop routing (NSR) capability that protects inter-domain connectivity from being disrupted by various failures, is critical to any Autonomous System (AS) operator. Replicating the BGP and underlying TCP connection status is key to realizing NSR. But existing NSR solutions, which heavily rely on OS kernel modifications, have become impractical due to providers’ adoption of virtualized network gateways for better scalability and manageability.

      In this paper, we tackle this problem by proposing TENSOR, which incorporates a novel kernel-modification-free replication design and lightweight architecture. More concretely, the kernel-modification-free replication design mitigates the reliance on OS kernel modification and hence allows the virtualization of the network gateway. Meanwhile, lightweight virtualization provides strong performance guarantees and improves system reliability. Moreover, TENSOR provides a solution to the split-brain problem that affects NSR solutions. Through extensive experiments, we show that TENSOR realizes NSR while bearing little overhead compared to open-source BGP implementations. Further, our two-year operational experience on a fleet of 400 servers controlling over 31,000 BGP peering connections demonstrates that TENSOR reduces the development, deployment, and maintenance costs significantly – at least by factors of 20, 5, and 10, respectively, while retaining the same SLA with the NSR-enabled routers.


  • Lessons from the Evolution of the Batfish Configuration Analysis Tool

    Matt Brown, Ari Fogel, Daniel Halperin, Victor Heorhiadi (Intentionet), Ratul Mahajan (Intentionet, University of Washington), Todd Millstein (Intentionet, UCLA)

    • Abstract: Batfish is a tool to analyze network configurations and forwarding. It has evolved from a research prototype to an industrial-strength product, guided by scalability, fidelity, and usability challenges encountered when analyzing complex, real-world networks. We share key lessons from this evolution, including how Datalog had significant limitations when generating and analyzing forwarding state and how binary decision diagrams (BDDs) proved highly versatile. We also describe our new techniques for addressing real-world challenges, which increase Batfish performance by three orders of magnitude and enable high-fidelity analysis of networks with thousands of nodes within minutes.


  • 2:30pm–2:45pm     Break

  • 2:45pm–3:45pm     Technical Session 3: Well Tested

    Slack channel     Location: Lerner Hall

  • P4Testgen: An Extensible Test Oracle For P4-16

    Artifac availableArtifacts Available     

    Fabian Ruffy (New York University), Jed Liu (Postman), Prathima Kotikalapudi, Vojtech Havel, Hanneli Tavante (Intel), Rob Sherwood (, Vladyslav Dubina, Volodymyr Peschanenko (Litsoft), Anirudh Sivaraman (New York University), Nate Foster (Cornell University)

    • Abstract: We present P4Testgen, a test oracle for the P416 language. P4Testgen supports automatic test generation for any P4 target and is designed to be extensible to many P4 targets. It models the complete semantics of the target's packet-processing pipeline including the P4 language, architectures and externs, and target-specific extensions. To handle non-deterministic behaviors and complex externs (e.g., checksums and hash functions), P4Testgen uses taint tracking and concolic execution. It also provides path selection strategies that reduce the number of tests required to achieve full coverage.

      We have instantiated P4Testgen for the V1model, eBPF, PNA, and Tofino P4 architectures. Each extension required effort commensurate with the complexity of the target. We validated the tests generated by P4Testgen by running them across the entire P4C test suite as well as the programs supplied with the Tofino P4 Studio. Using the tool, we have also confirmed 25 bugs in mature, production toolchains for BMv2 and Tofino.


  • Beyond a Centralized Verifier: Scaling Data Plane Checking via Distributed, On-Device Verification

    Artifac availableArtifacts Available     

    Qiao Xiang, Chenyang Huang, Ridi Wen, Yuxin Wang, Xiwen Fan (Xiamen University), Zaoxing Liu (University of Maryland), Linghe Kong (Shanghai Jiao Tong University), Dennis Duan (Yale University), Franck Le (IBM Research), Wei Sun (University of Texas at Austin)

    • Abstract: Centralized data plane verification (DPV) faces significant scalability issues in large networks (i.e., the verifier being a performance bottleneck and single point of failure and requiring a reliable management network). In this paper, we tackle the scalability challenge of DPV by introducing Tulkun, a distributed, on-device DPV framework. Our key insight is that DPV can be transformed into a counting problem on a directed acyclic graph, which can be naturally decomposed into lightweight tasks executed at network devices, enabling fast data plane checking in networks of various scales and types. With this insight, Tulkun consists of (1) a declarative invariant specification language, (2) a planner that employs a novel data structure DVNet to systematically decompose global verification into on-device counting tasks, (3) a distributed verification messaging (DVM) protocol that specifies how on-device verifiers efficiently communicate task results to jointly verify the invariants, and (4) a mechanism to verify invariant fault-tolerance with minimal involvement of the planner. Extensive experiments with real-world datasets (WAN/LAN/DC) show that Tulkun verifies a real, large DC in less than 41 seconds while other tools need several minutes or up to tens of hours, and shows an up to 2355× speed up on 80% quantile of incremental verification with small overheads on commodity network devices.


  • DONS: Fast and Affordable Discrete Event Network Simulation with Automatic Parallelization

    Artifac availableArtifacts Available     

    Kaihui Gao (Tsinghua University), Li Chen (Zhongguancun Laboratory), Dan Li (Tsinghua University), Vincent Liu (University of Pennsylvania), Xizheng Wang (Tsinghua University), Ran Zhang (Zhongguancun Laboratory), Lu Lu (China Mobile Research Institute)

    • Abstract: Discrete Event Simulation (DES) is an essential tool for network practitioners. Unfortunately, existing DES simulators cannot achieve satisfactory performance at the scale of modern networks. Recent work has attempted to address these challenges by reducing the traffic processed via novel approximation techniques; however, we argue in this paper that much of the slowdown of existing DES simulators is due to their underlying software architecture. Using ideas from high-throughput simulation of virtual worlds in gaming, this paper presents a fundamental redesign of DES network simulator, DONS, that marries domain-specific aspects of packetlevel network simulation with recent advances in data-oriented design. DONS can automatically parallelize simulation within and across servers to achieve high core utilization, low cache miss rate, and high memory efficiency. On a relatively weak ARM-based laptop (MacBook Air (M1, 2020)), DONS can simulate one second of a 100 Gbps, 1024-server data center in 22 minutes (a speedup of 21× compared to OMNeT++). On a cluster of CPU-based servers, DONS can achieve a speedup of 65×, matching the order of magnitude of recent GPU-accelerated deep learning performance estimators, but without any loss of accuracy.


  • Hydra: Effective Runtime Network Verification

    Artifac availableArtifacts Available     

    Sundararajan Renganathan (Stanford University), Benny Rubin (Cornell University), Hyojoon Kim (Princeton University), Pier Luigi Ventre, Carmelo Cascone, Daniele Moro, Charles Chan (Intel), Nick McKeown (Stanford University), Nate Foster (Cornell University)

    • Abstract: It is notoriously difficult to verify that a network is behaving as intended, especially at scale. This paper presents Hydra, a system that uses ideas from runtime verification to check that every packet is correctly processed with respect to a specification in real time. We propose a domain-specific language for writing properties, called Indus, and we develop a compiler that turns properties thus specified into executable P4 code that runs alongside the forwarding code at line rate. To evaluate our approach, we used Indus to model a range of properties, showing that it is expressive enough to capture examples studied in prior work. We also deployed Hydra checkers for validating paths in source routing and for enforcing slice isolation in Aether, an open-source cellular platform. We confirmed a subtle bug in Aether’s 5G mobile core that would have been hard to detect using static techniques. We also evaluated the overheads of Hydra on hardware, finding that it does not significantly increase latency and often does not require additional pipeline stages.


  • 3:45pm-4:15pm     Break

  • 4:15pm–5:15pm     Technical Session 4: Well Optimized

    Slack channel     Location: Lerner Hall

  • NetClone: Fast, Scalable, and Dynamic Request Cloning for Microsecond-Scale RPCs

    Artifac availableArtifacts Available     

    Gyuyeong Kim (Sungshin Women's University)

    • Abstract: Spawning duplicate requests, called cloning, is a powerful technique to reduce tail latency by masking service-time variability. However, traditional client-based cloning is static and harmful to performance under high load, while a recent coordinator-based approach is slow and not scalable. Both approaches are insufficient to serve modern microsecond-scale Remote Procedure Calls (RPCs). To this end, we present NetClone, a request cloning system that performs cloning decisions dynamically within nanoseconds at scale. Rather than the client or the coordinator, NetClone performs request cloning in the network switch by leveraging the capability of programmable switch ASICs. Specifically, NetClone replicates requests based on server states and blocks redundant responses using request fingerprints in the switch data plane. To realize the idea while satisfying the strict hardware constraints, we address several technical challenges when designing a custom switch data plane. NetClone can be integrated with emerging in-network request schedulers like RackSched. We implement a NetClone prototype with an Intel Tofino switch and a cluster of commodity servers. Our experimental results show that NetClone can improve the tail latency of microsecond-scale RPCs for synthetic and real-world application workloads and is robust to various system conditions.


  • BMW Tree: Large-scale, High-throughput and Modular PIFO Implementation using Balanced Multi-Way Sorting Tree

    Ruyi Yao, Zhiyu Zhang, Gaojian Fang (Fudan University), Peixuan Gao (New York University), Sen Liu (Fudan University), Yibo Fan (State Key Laboratory of ASIC and System, Fudan University), Yang Xu (Fudan University, Peng Cheng Laboratory), H. Jonathan Chao (New York University)

    • Abstract: Push-In-First-Out (PIFO) queue has been extensively studied as a programmable scheduler. To achieve accurate, large-scale, and high-throughput PIFO implementation, we propose the Balanced Multi-way (BMW) Sorting Tree for real-time packet sorting. The tree is highly modularized, insertion-balanced and pipeline-friendly with autonomous nodes.

      Based on it, we design two simple and efficient hardware designs. The first one is a register-based (R-BMW) scheme. With a pipeline, it features an impressively high and stable throughput without any frequency reduction theoretically even under more levels. We then propose Ranking Processing Units to drive the BMW-Tree (RPU-BMW) to improve the scalability, where nodes are stored in SRAMs and dynamically loaded into/off from RPUs. As the capacity of BMW-Tree grows exponentially, only a few RPUs are needed for a large scale.

      The evaluation shows that when deployed on the Xilinx Alveo U200 card, R-BMW improves the throughput by 4.8x compared to the original PIFO implementation, while exhibiting a similar capacity. RPU-BMW is synthesized in GlobalFoundries 28nm process, costing a modest 0.522% (1.043mm2) chip area and 0.57MB off-chip memory to support 87k flows at 200Mpps. To our best knowledge, RPU-BMW is the first accurate PIFO implementation supporting over 80k flows at as fast as 200Mpps.


  • BitSense: Universal and Nearly Zero-Error Optimization for Sketch Counters with Compressive Sensing

    Artifac availableArtifacts Available     

    Rui Ding, Shibo Yang, Xiang Chen, Qun Huang (Peking University)

    • Abstract: Sketch algorithms have been widely deployed for network measurement as they achieve high accuracy with restricted resource usage. They store measurement results compactly in fixed-size counters. However, as sketch counters are skewed towards low values, higher bits in most counters remain zero. Such massive unused bits impair the space efficiency valued by sketch algorithms. Unfortunately, efforts to mitigate the issue either apply to specific algorithms or compromise accuracy. In this paper, we design BitSense, a novel optimization framework that integrates with existing sketch algorithms. The key idea is to regard higher bits in sketch counters as a sparse vector and leverage compressive sensing techniques to compress and restore counters. Further, BitSense provides a programming model to help developers easily realize sketch algorithms without dealing with the details of compression and recovery. BitSense proposes an automatic approach for parameter configuration. It theoretically guarantees nearly zero error under the configuration. We have built a BitSense prototype in P4 and a software platform and integrated it with fourteen sketch solutions. Extensive experiments show that BitSense significantly reduces the memory usage of existing sketch solutions by 25%-80% while incurring little overhead and almost zero accuracy drop, outperforming five state-of-the-art optimization frameworks.


  • NeoBFT: Accelerating Byzantine Fault Tolerance Using Authenticated In-Network Ordering

    Artifac availableArtifacts Available     

    Guangda Sun, Mingliang Jiang, Xin Zhe Khooi, Yunfan Li, Jialin Li (National University of Singapore)

    • Abstract: Mission critical systems deployed in data centers today are facing more sophisticated failures. Byzantine fault tolerant (BFT) protocols are capable of masking these types of failures, but are rarely deployed due to their performance cost and complexity. In this work, we propose a new approach to designing high performance BFT protocols in data centers. By re-examining the ordering responsibility between the network and the BFT protocol, we advocate a new abstraction offered by the data center network infrastructure. Concretely, we design a new authenticated ordered multicast primitive (AOM) that provides transferable authentication and non-equivocation guarantees. Feasibility of the design is demonstrated by two hardware implementations of AOM – one using HMAC and the other using public key cryptography for authentication – on new-generation programmable switches. We then co-design a new BFT protocol, NeoBFT, that leverages the guarantees of AOM to eliminate cross-replica coordination and authentication in the common case. Evaluation results show that NeoBFT outperforms state-of-the-art protocols on both latency and throughput metrics by a wide margin, demonstrating the benefit of our new network ordering abstraction for BFT systems.


  • 5:30pm–6:00pm     Best of CCR

    Slack channel    

  • Who squats IPv4 Addresses?

    Loqman Salamatian (Columbia University), Todd Arnold (Army Cyber Institute, West Point), Italo Cunha (Universidade Federal de Minas Gerais), Jiangchen Zhu, Yunfan Zhang, Ethan Katz-Bassett, Matt Calder (Columbia University)

    • Abstract: To mitigate IPv4 exhaustion, IPv6 provides expanded address space, and NAT allows a single public IPv4 address to suffice for many devices assigned private IPv4 address space. Even though NAT has greatly extended the shelf-life of IPv4, some networks need more private IPv4 space than what is officially allocated by IANA due to their size and/or network management practices. Some of these networks resort to using squat space, a term the network operations community uses for large public IPv4 address blocks allocated to organizations but historically never announced to the Internet. While squatting of IP addresses is an open secret, it introduces ethical, legal, and technical problems. In this work we examine billions of traceroutes to identify thousands of organizations squatting. We examine how they are using it and what happened when the US Department of Defense suddenly started announcing what had traditionally been squat space. In addition to shining light on a dirty secret of operational practices, our paper shows that squatting distorts common Internet measurement methodologies, which we argue have to be re-examined to account for squat space.


  • The Packet Number Space Debate in Multipath QUIC

    Quentin De Coninck (UCLouvain)

    • Abstract: With a standardization process that attracted much interest, QUIC can been seen as the next general-purpose transport protocol. Still, it does not provide true multipath support yet, missing some use cases that Multipath TCP addresses. To fill that gap, the IETF recently adopted a Multipath proposal merging several proposed designs. While it focuses on its core components, there still remains one major design issue: the amount of packet number spaces that should be used. This paper provides experimental results with two different Multipath QUIC implementations based on NS3 simulations to understand the impact of using one packet number space per path or a single packet number space for the whole connection. Our results show that using one packet number space per path makes Multipath QUIC more resilient to the receiver's heuristics to acknowledge packets and detect duplicates.


  • 6:30pm–9:00pm     Student Dinner

    Location: Low Rotunda

  • End of the day

  • Tuesday, September 12, 2023

  • 8:00am–10:00am     Breakfast

    Location: Lerner Hall

  • 8:30am–9:30am     Technical Session 5: Congestion Control

    Slack channel     Location: Lerner Hall

  • Computers Can Learn from the Heuristic Designs and Master Internet Congestion Control

    Artifac availableArtifacts Available     

    Chen-Yu Yen (New York University), Soheil Abbasloo (University of Toronto), H. Jonathan Chao (New York University)

    • Abstract: In this work, for the first time, we demonstrate that computers can automatically learn from observing the heuristic efforts of the last four decades, stand on the shoulders of the existing Internet congestion control (CC) schemes, and discover a better-performing one. To that end, we address many different practical challenges, from how to generalize representation of various existing CC schemes to serious challenges regarding learning from a vast pool of policies in the complex CC domain and introduce Sage. Sage is the first purely data-driven Internet CC design that learns a better scheme by harnessing the existing solutions. We compare Sage's performance with the state-of-the-art CC schemes through extensive evaluations on the Internet and in controlled environments. The results indicate that Sage has learned a better-performing policy. While there are still many unanswered questions, we hope our data-driven framework can pave the way for a more sustainable design strategy.


  • Host Congestion Control

    Saksham Agarwal (Cornell University), Arvind Krishnamurthy (Google/University of Washington), Rachit Agarwal (Cornell University)

    • Abstract: The conventional wisdom in systems and networking communities is that congestion happens primarily within the network fabric. However, adoption of high-bandwidth access links and relatively stagnant technology trends for resources within hosts have led to emergence of host congestion—that is, congestion within the host network that enables data exchange between NIC and CPU/memory. Such host congestion alters the many assumptions entrenched within decades of research and practice of congestion control.

      We present hostCC, a congestion control architecture to handle both host and network fabric congestion. hostCC embodies three key ideas. First, in addition to congestion signals that originate within the network fabric, hostCC collects host congestion signals that capture the precise time, location, and reason for host congestion. Second, hostCC introduces a sub-RTT granularity host-local congestion response that uses congestion signals to allocate host resources between network traffic and host-local traffic. Finally, hostCC uses both host and network congestion signals to allocate network resources at an RTT granularity.

      We realize hostCC within the Linux network stack. Our hostCC implementation requires no modifications in applications, host hardware, and/or network hardware; moreover, it can be integrated with existing congestion control protocols to handle both host and network fabric congestion. Evaluation of Linux DCTCP with and without hostCC suggests that, in the presence of host congestion, hostCC significantly reduces queueing and packet drops at the host, resulting in improved performance of networked applications in terms of throughput and tail latency.


  • Masking Corruption Packet Losses in Datacenter Networks with Link-local Retransmission

    Artifac availableArtifacts Available     

    Raj Joshi (National University of Singapore), Cha Hwan Song (School of Computing, National university of Singapore), Xin Zhe Khooi (National University of Singapore), Nishant Budhdev (Nokia Bell Labs), Ayush Mishra (National Univeristy of Singapore), Mun Choon Chan (School of Computing, National University of Singapore), Ben Leong (National University of Singapore)

    • Abstract: Packet loss due to link corruption is a major problem in large warehouse-scale datacenters. The current state-of-the-art approach of disabling corrupting links is not adequate because, in practice, all the corrupting links cannot be disabled due to capacity constraints. In this paper, we show that, it is feasible to implement link-local retransmission at sub-RTT timescales to completely mask corruption packet losses from the transport endpoints. Our system, LinkGuardian, employs a range of techniques to (i) keep the packet buffer requirement low, (ii) recover from tail packet losses without employing timeouts, and (iii) preserve packet ordering. We implement LinkGuardian on the Intel Tofino switch and show that for a 100G link with a loss rate of 10−3, LinkGuardian can reduce the loss rate by up to 6 orders of magnitude while incurring only 8% reduction in effective link speed. By eliminating tail packet losses, LinkGuardian improves the 99.9th percentile flow completion time (FCT) for TCP and RDMA by 51x and 66x respectively. Finally, we also show that in the context of datacenter networks, simple out-of-order retransmission is often sufficient to significantly mitigate the impact of corruption packet loss for short TCP flows.


  • Augmented Queue: A Scalable In-Network Abstraction for Data Center Network Sharing

    Xinyu Crystal Wu, Zhuang Wang, Weitao Wang, T. S. Eugene Ng (Rice University)

    • Abstract: Traffic aggregates in cloud data center networks are by and large buffered and transmitted by simple physical FIFO queues. Despite the crucial role they play, a well-known problem of physical FIFO queues is that they are unable to provide precise bandwidth guarantees. This leads to a range of negative impacts spanning the application layer, the transport layer, and the data link layer.

      In this paper, we address this problem with Augmented Queue (AQ), a scalable in-network abstraction that provides precise bandwidth guarantees for traffic constituents. AQ serves multiple valuable use cases in data center networks. For example, AQ facilitates the isolation of traffic from different applications; ensures that different congestion control algorithms can properly co-exist; and enforces inbound and outbound bandwidth for virtual machines. We demonstrate via testbed and simulation experiments that AQ can provide precise bandwidth guarantees and scale to millions of traffic constituents.


  • 9:30am-9:45am     Break

  • 9:45am–10:55am     Technical Session 6: Traffic Engineering

    Slack channel     Location: Lerner Hall

  • FlexWAN: Software Hardware Co-design for Cost-Effective and Resilient Optical Backbones

    Artifac availableArtifacts Available     

    Congcong Miao (Tencent), Zhizhen Zhong (Massachusetts Institute of Technology), Ying Zhang (Meta), Kunling He, Fangchao Li, Minggang Chen, Yiren Zhao, Xiang Li, Zekun He, Xianneng Zou (Tencent), Jilong Wang (Tsinghua University, Quancheng Laboratory)

    • Abstract: The rising demand for WAN capacity driven by the rapid growth of inter-data center traffic poses new challenges for costly optical networks. Today, cloud providers rely on fixed optical backbones, where all hardware devices operate on a rigid spectrum grid, leading to the waste of expensive optical resources and subpar performance in handling failures. In this paper, we introduce FlexWAN, a novel flexible WAN infrastructure designed to provision cost-effective WAN capacity while ensuring resilience to optical failures. FlexWAN achieves this by incorporating spacing-variable hardware at the optical layer, enabling the generated wavelength to optimize the utilization of limited spectrum resources for WAN capacity. The configuration of spacing-variable hardware in a multi-vendor optical backbone presents challenges related to spectrum management. To address this, FlexWAN leverages a centralized controller to achieve the coordinated control of network-wide optical devices in a vendor-agnostic manner. Moreover, the flexibility at the optical layer introduces new algorithmic problems. FlexWAN formulates the problem of provisioning WAN capacity with the goal of minimizing hardware costs. We evaluate the system performance in production and share insights from years of production experience. Compared to the existing optical backbone, FlexWAN can save at least 57% of transponders and reduce 36% of spectrum usage while continuing to meet up to 8× the present-day demands using existing hardware and fiber deployments. FlexWAN further incorporates failure resilience that revives 15% more bandwidth capacity in the overloaded optical backbone


  • Hose-based cross-layer backbone network design with Benders decomposition

    John P. Eason, Xueqi He, Richard Cziva, Max Noormohammadpour, Srivatsan Balasubramanian, Satyajeet Singh Ahuja, Biao Lu (Meta Platforms, Inc)

    • Abstract: Network design is the process of dimensioning IP capacity over an optical network infrastructure to satisfy a given set of demands and reliability constraints. Specifically, we consider the problem of hose-based cross-layer network design, which seeks to find a minimum cost design that is able to route demand for all hose traffic matrices under all specified failure states. While most network design problems are solved as Mixed Integer Programs, a commercial solver can become intractable due to the scale of today's networks. We demonstrate how the classic Benders decomposition algorithm can be applied and improved for this problem and discuss practical implementation aspects. We showcase a horizontally scalable distributed framework to leverage the decomposable problem structure and solve millions of linear programs in a distributed manner, thereby making the network design problem tractable. In contrast to the conventional approach where failure states and traffic matrices are planned sequentially, the Benders algorithm finds global optimal designs across all traffic matrices and failure states. This leads to network designs with improved solution quality and reliability, with 20-30% less IP capacity and spectrum consumption, 50% less link augments and up to 20x faster runtime that enables design for hyper scale networks in a matter of hours.


  • EBB: Reliable and Evolvable Express Backbone Network in Meta

    Marek Denis, Yuanjun Yao, Ashley Hatch, Qin Zhang, Chiun Lin Lim, Shuqiang Zhang, Kyle Sugrue, Henry Kwok, Mikel Jimenez Fernandez, Petr Lapukhov, Sandeep Hebbani, Gaya Nagarajan, Omar Baldonado (Meta), Lixin Gao (UMass Amherst/Meta), Ying Zhang (Meta)

    • Abstract: We present the design, implementation, evaluation, deployment and production experiences of EBB (Express BackBone), a private WAN (Wide Area Network) connecting Meta's global data centers (DCs). Initiated in 2015, EBB now carries 100% of DC-DC traffic, witnessing remarkable growth over the years. A key design aspect of EBB is its multi-plane architecture, facilitating seamless deployment of a new control plane while ensuring operational simplicity. This architecture allows for efficient failure mitigation, standard maintenance, and capacity expansion by draining one or two planes without impacting service level objectives (SLOs). Another critical design decision is the hybrid model, combining distributed control agents and a central controller. EBB's centralized traffic engineering utilizes an MPLS-TE based solution to allocate paths periodically for different traffic classes based on service requirements, while its distributed control agents enable fast local failure recovery by pre-installing pre-computed backup paths in the data plane. We delve into our eight-year production experience, highlighting the successful deployment of multiple generations of EBB.


  • PAINTER: Ingress Traffic Engineering and Routing for Enterprise Cloud Networks

    Thomas Koch, Shuyue Yu (Columbia University), Sharad Agarwal (Microsoft), Ethan Katz-Bassett (Columbia University), Ryan Beckett (Microsoft Research)

    • Abstract: Enterprises increasingly use public cloud services for critical business needs. However, Internet protocols force clouds to choose between high availability and performance, reducing the speed at which clouds can respond to network problems, the range of solutions they can provide, and deployment resilience. To overcome this limitation, we present PAINTER, a system that takes control over which routes are available and which are chosen to the cloud by leveraging edge proxies. PAINTER efficiently advertises BGP prefixes, exposing more concurrent routes than existing solutions to improve latency and resilience. Compared to existing solutions, PAINTER reduces path inflation by 75% while using a third of the prefixes of other solutions, avoids 20% more path failures, and chooses ingresses from the edge at finer time (RTT) and traffic (per-flow) granularities, enhancing our agility.


  • Teal: Learning-Accelerated Optimization of WAN Traffic Engineering

    Artifac availableArtifacts Available     

    Zhiying Xu (Harvard University), Francis Y. Yan (Microsoft Research), Rachee Singh, Justin T. Chiu, Alexander M. Rush (Cornell University), Minlan Yu (Harvard University)

    • Abstract: The rapid expansion of global cloud wide-area networks (WANs) has posed a challenge for commercial optimization engines to efficiently solve network traffic engineering (TE) problems at scale. Existing acceleration strategies decompose TE optimization into concurrent subproblems but realize limited parallelism due to an inherent tradeoff between run time and allocation performance.

      We present Teal, a learning-based TE algorithm that leverages the parallel processing power of GPUs to accelerate TE control. First, Teal designs a flow-centric graph neural network (GNN) to capture WAN connectivity and network flows, learning flow features as inputs to downstream allocation. Second, to reduce the problem scale and make learning tractable, Teal employs a multi-agent reinforcement learning (RL) algorithm to independently allocate each traffic demand while optimizing a central TE objective. Finally, Teal fine-tunes allocations with ADMM (Alternating Direction Method of Multipliers), a highly parallelizable optimization algorithm for reducing constraint violations such as overutilized links.

      We evaluate Teal using traffic matrices from Microsoft's WAN. On a large WAN topology with >1,700 nodes, Teal generates near-optimal flow allocations while running several orders of magnitude faster than the production optimization engine. Compared with other TE acceleration schemes, Teal satisfies 6–32% more traffic demand and yields 197–625× speedups.


  • 11:00am–12:00pm     Poster Session

    Slack channel     Location: Lerner Hall

  • 12:00pm–1:30pm     Lunch

  • 1:30pm–2:30pm     Technical Session 7: Application Analytics

    Slack channel     Location: Lerner Hall

  • Fathom: Understanding Datacenter Application Network Performance

    Mubashir Adnan Qureshi, Junhua Yan, Yuchung Cheng, Soheil Hassas Yeganeh, Yousuk Seung, Neal Cardwell, Willem de Bruijn, Van Jacobson (Google), Jasleen Kaur (University of North Carolina at Chapel Hill), David Wetherall, Amin Vahdat (Google)

    • Abstract: We describe our experience with Fathom, a system for identifying the network performance bottlenecks of any service running in the Google fleet. Fathom passively samples RPCs, the principal unit of work for services. It segments the overall latency into host and network components with kernel and RPC stack instrumentation. It records these detailed latency metrics, along with detailed transport connection state, for every sampled RPC. This lets us determine if the completion is constrained by the client, network or server. To scale while enabling analysis, we also aggregate samples into distributions that retain multi-dimensional breakdowns. This provides us with a macroscopic view of individual services. Fathom runs globally in our datacenters for all production traffic, where it monitors billions of TCP connections 24x7. For several years Fathom has been our primary tool for troubleshooting service network issues and assessing network infrastructure changes. We present case studies to show how it has helped us improve our production services.


  • Ditto: Efficient Serverless Analytics with Elastic Parallelism

    Artifac availableArtifacts Available     

    Chao Jin, Zili Zhang, Xingyu Xiang, Songyun Zou, Gang Huang, Xuanzhe Liu, Xin Jin (Peking University)

    • Abstract: Serverless computing provides fine-grained resource elasticity for data analytics—a job can flexibly scale its resources for each stage, instead of sticking to a fixed pool of resources throughout its lifetime. Due to different data dependencies and different shuffling overheads caused by intra- and inter-server communication, the best degree of parallelism (DoP) for each stage varies based on runtime conditions.

      We present Ditto, a job scheduler for serverless analytics that leverages fine-grained resource elasticity to optimize for job completion time (JCT) and cost. The key idea of Ditto is to use a new scheduling granularity—stage group—to decouple parallelism configuration from function placement. Ditto bundles stages into stage groups based on their data dependencies and IO characteristics. It exploits the parallelized time characteristics of the stages to determine the parallelism configuration, and prioritizes the placement of stage groups with large shuffling traffic, so that the stages in these groups can leverage zero-copy intra-server communication for efficient shuffling. We build a system prototype of Ditto and evaluate it with a variety of benchmarking workloads. Experimental results show that Ditto outperforms existing solutions by up to 2.5× on JCT and up to 1.8× on cost.


  • Network-Centric Distributed Tracing with DeepFlow: Troubleshooting Your Microservices in Zero Code

    Artifac availableArtifacts Available     

    Junxian Shen, Han Zhang (Tsinghua University), Yang Xiang (Yunshan Networks), Xingang Shi, Xinrui Li, Yunxi Shen, Zijian Zhang, Yongxiang Wu, Xia Yin (Tsinghua University), Jilong Wang (Tsinghua university), Mingwei Xu (Tsinghua University), Yahui Li (Tsinghua University,China), Jiping Yin, Jianchang Song, Zhuofeng Li, Runjie Nie (Yunshan Networks)

    • Abstract: Microservices are becoming more complicated, posing new challenges for traditional performance monitoring solutions. On the one hand, the rapid evolution of microservices places a significant burden on the utilization and maintenance of existing distributed tracing frameworks. On the other hand, complex infrastructure increases the probability of network performance problems and creates more blind spots on the network side. In this paper, we present DeepFlow, a network-centric distributed tracing framework for troubleshooting microservices. DeepFlow provides out-of-the-box tracing via a network-centric tracing plane and implicit context propagation. In addition, it eliminates blind spots in network infrastructure, captures network metrics in a low-cost way, and enhances correlation between different components and layers. We demonstrate analytically and empirically that DeepFlow is capable of locating microservice performance anomalies with negligible overhead. DeepFlow has already identified over 71 critical performance anomalies for more than 26 companies and has been utilized by hundreds of individual developers. Our production evaluations demonstrate that DeepFlow is able to save users hours of instrumentation efforts and reduce troubleshooting time from several hours to just a few minutes.


  • Murphy: Performance Diagnosis of Distributed Cloud Applications

    Vipul Harsh (University of Illinois Urbana-Champaign), Wenxuan Zhou (VMware), Sachin Ashok (University of Illinois Urbana - Champaign), Radhika Niranjan Mysore (VMware Research), Brighten Godfrey (UIUC and VMware), Sujata Banerjee (VMware Research)

    • Abstract: Modern cloud-based applications have complex inter-dependencies on both distributed application components as well as network infrastructure, making it difficult to reason about their performance. As a result, a rich body of work seeks to automate performance diagnosis of enterprise networks and such cloud applications. However, existing methods either ignore inter-dependencies which results in poor accuracy, or require causal acyclic dependencies which cannot model common enterprise environments

      We describe the design and implementation of Murphy, an automated performance diagnosis system, that can work with commonly available telemetry in practical enterprise environments, while achieving high accuracy. Murphy utilizes loosely-defined associations between entities obtained from commonly available monitoring data. Its learning algorithm is based on a Markov Random Field (MRF) that can take advantage of such loose associations to reason about how entities affect each other in the context of a specific incident. We evaluate Murphy in an emulated microservice environment and in real incidents from a large enterprise. Compared to past work, Murphy is able to reduce diagnosis error by ≈ 1.35× in restrictive environments supported by past work, and by ≥ 4.7× in more general environments.


  • 2:30pm–2:45pm     Break

  • 2:45pm–3:45pm     Technical Session 8: On Inference

    Slack channel     Location: Lerner Hall

  • Lightning: A Reconfigurable Photonic-Electronic SmartNIC for Fast and Energy-Efficient Inference

    Artifac availableArtifacts Available     

    Zhizhen Zhong, Mingran Yang, Jay Lang, Christian Williams, Liam Kronman, Alexander Sludds, Homa Esfahanizadeh, Dirk Englund, Manya Ghobadi (Massachusetts Institute of Technology)

    • Abstract: The massive growth of machine learning-based applications and the end of Moore's law have created a pressing need to redesign computing platforms. We propose Lightning, the first reconfigurable photonic-electronic smartNIC to serve real-time deep neural network inference requests. Lightning uses a fast datapath to feed traffic from the NIC into the photonic domain without creating digital packet processing and data movement bottlenecks. To do so, Lightning leverages a novel reconfigurable count-action abstraction that keeps track of the required computation operations of each inference packet. Our count-action abstraction decouples the compute control plane from the data plane by counting the number of operations in each task and triggers the execution of the next task(s) without interrupting the dataflow. We evaluate Lightning's performance using four platforms: a prototype, chip synthesis, emulations, and simulations. Our prototype demonstrates the feasibility of performing 8-bit photonic multiply-accumulate operations with 99.25% accuracy. To the best of our knowledge, our prototype is the highest-frequency photonic computing system, capable of serving real-time inference queries at 4.055 GHz end-to-end. Our simulations with large DNN models show that compared to Nvidia A100 GPU, A100X DPU, and Brainwave smartNIC, Lightning accelerates the average inference serve time by 337x, 329x, and 42x, while consuming 352x, 419x, and 54x less energy, respectively.


  • AdaInf: Data Drift Adaptive Scheduling for Accurate and SLO-guaranteed Multiple-Model Inference Serving at Edge Servers

    Sudipta Saha Shubha, Haiying Shen (University of Virginia)

    • Abstract: Various audio and video applications rely on deep neural network (DNN) models deployed on edge servers to conduct inference with ms-level latency service-level-objectives (SLOs). The scale of the applications has been increasingly growing with multiple DNN models incorporated into one application. Accuracy drop from data drift requires conducting both continual retraining and inference serving for multi-model applications in an edge server, which introduces a challenge on GPU resource allocation to satisfy the tight SLOs and meanwhile achieve high accuracy in this scenario. However, there has been no research devoted to tackling this challenge. In this paper, we first conducted trace-based experimental analysis, which shows that different models have different impact degrees from data drift, incremental retraining (proposed by us that retrains certain samples before inference) helps increase accuracy, and etc. By leveraging the observations, we propose a data drift Adaptive scheduler for accurate and SLO-guaranteed Inference serving at edge servers (AdaInf). AdaInf uses incremental retraining. It allocates GPU amount among applications based on their SLOs, and for each application, further splits GPU time between retraining and inference to satisfy its SLO, and then allocates GPU time among retraining tasks based on their impact degrees. Besides, AdaInf has strategies to reduce the influence of CPU-GPU memory communications on latency. Our real trace-driven experimental evaluation shows that AdaInf increases up to 21% accuracy and reduces up to 54% SLO violations compared to the existing methods, and takes 2ms scheduling time. Achieving similar accuracy as AdaInf requires 4× more GPU resources on the edge server for the existing method.


  • Janus: A Unified Distributed Training Framework for Sparse Mixture-of-Experts Models

    Juncai Liu (Tsinghua University), Jessie Hui Wang (Tsinghua University and Zhongguancun Laboratory), Yimin Jiang (bytedance)

    • Abstract: Scaling models to large sizes to improve performance has led a trend in deep learning, and sparsely activated Mixture-of-Expert (MoE) is a promising architecture to scale models. However, training MoE models in existing systems is expensive, mainly due to the All-to-All communication between layers.

      All-to-All communication originates from expert-centric paradigm: keeping experts in-place and exchanging intermediate data to feed experts. We propose the novel data-centric paradigm: keeping data in-place and moving experts between GPUs. Since experts' size can be smaller than the size of data, data-centric paradigm can reduce communication workload. Based on this insight, we develop Janus. First, Janus supports fine-grained asynchronous communication, which can overlap computation and communication. Janus implements a hierarchical communication to further reduce cross-node traffic by sharing the fetched experts in the same machine. Second, when scheduling the ``fetching expert'' requests, Janus implements a topology-aware priority strategy to utilize intra-node and inter-node links efficiently. Finally, Janus allows experts to be prefetched, which allows the downstream computation to start immediately once the previous step completes.

      Evaluated on a 32-A100 cluster, Janus can reduce the traffic up to 16× and achieves up to 2.06× speedup compared with current MoE training system.


  • Lightwave Fabrics: At-Scale Optical Circuit Switching for Datacenter and Machine Learning Systems

    Hong Liu, Ryohei Urata, Kevin Yasumura, Xiang Zhou, Roy Bannon, Jill Berger, Pedram Dashti, Norm Jouppi, Cedric Lam, Sheng Li, Erji Mao, Daniel Nelson (Google), George Papen (Google/UC San Diego), Mukarram Tariq, Amin Vahdat (Google)

    • Abstract: We describe our experience developing what we believe to be the world’s first large-scale production deployments of lightwave fabrics used for both datacenter networking and machine-learning (ML) applications. Using optical circuit switches (OCSes) and optical transceivers developed in-house, we employ hardware and software codesign to integrate the fabrics into our network and computing infrastructure. Key to our design is a high degree of multiplexing enabled by new kinds of wavelength-division-multiplexing (WDM) and optical circulators that support high-bandwidth bidirectional traffic on a single strand of optical fiber. The development of the requisite OCS and optical transceiver technologies leads to a synchronous lightwave fabric that is reconfigurable, low latency, rate agnostic, and highly available. These fabrics have provided substantial benefits for long-lived traffic patterns in our datacenter networks and predictable traffic patterns in tightly-coupled machine learning clusters. We report results for a large-scale ML superpod with 4096 tensor processing unit (TPU) V4 chips that has more than one ExaFLOP of computing power. For this use case, the deployment of a lightwave fabric provides up to 3× better system availability and model-dependent performance improvements of up to 3.3× compared to a static fabric, despite constituting less than 6% of the total system cost.


  • 3:45pm-4:15pm     Break

  • 4:15pm–5:15pm     Technical Session 9: Quality Experiences

    Slack channel     Location: Lerner Hall

  • Dragonfly: Higher Perceptual Quality For Continuous 360° Video Playback

    Artifac availableArtifacts Available     

    Ehab Ghabashneh, Chandan Bothra (Purdue University), Ramesh Govindan, Antonio Ortega (University of Southern California), Sanjay Rao (Purdue University)

    • Abstract: When streaming 360° video, it is possible to reduce bandwidth by 5× with approaches that spatially segment video into tiles and only stream the user’s viewport. Unfortunately, it is difficult to accurately predict a user’s viewport even 2-3 seconds before playback. This results in rebuffering events owing to misprediction of a user’s viewport or network bandwidth dips, which hurts interactive experience. However, avoiding rebuffering by naively skipping tiles that do not arrive by the playback deadline may lead to incomplete viewports and degraded experience.

      In this paper, we describe Dragonfly, a new 360° system that preserves interactive experience by avoiding playback stalls while maintaining high perceptual quality. Dragonfly prudently skips tiles using a model that defines an overall utility function to decide which tiles to fetch, and at which qualities they should be fetched, with the goal of optimizing user experience. To minimize incomplete viewports, it also fetches a low quality masking stream. Using a user study with 26 users and emulation-based experiments we show that Dragonfly has higher quality, and lower overheads, than state-of-the-art 360° streaming approaches. For instance, in our study, 65% of sessions have a rating of 4 or higher (Good/Excellent) with Dragonfly, while only 16% of sessions with Pano, and 13% of sessions with Flare achieve this rating.


  • Ekho: Synchronizing Cloud Gaming Media Across Multiple Endpoints

    Pouya Hamadanian (MIT), Doug Gallatin (Microsoft), Mohammad Alizadeh (MIT), Krishna Chintalapudi (Microsoft Research)

    • Abstract: Online cloud gaming platforms stream game media to multiple endpoints (e.g., a television display and a controller-connected headset) via possibly different networks with considerably different latencies. This leads to the media being played out of sync with one another, and severely degrades user experience. Typical approaches that rely on network and software timing measurements fail to reach synchronization goals. In this work, we propose Ekho, a robust and efficient end-to-end approach for synchronizing streams transmitted to two devices. Ekho adds faint, human-inaudible pseudo-noise (PN) markers to the game audio, and listens for these markers in the chat audio captured by the player's microphone to measure inter-stream delay (ISD). The game server then compensates for the ISD to synchronize the streams. We evaluate Ekho in depth, with a corpus of audio samples from popular online games, and demonstrate that it calculates ISD with sub-millisecond accuracy, has low computational overhead, and is resilient to background chatter, compression and microphone quality. In end-to-end tests over WiFi and cellular links with frequent packet loss and playback disruption, Ekho maintains human-imperceptible ISD (< 10 ms) 86.8% of the time. Without Ekho, the ISD exceeds 50 ms at all times.


  • DBO: Fairness for Cloud-Hosted Financial Exchanges

    Eashan Gupta (UIUC, Microsoft Research), Prateesh Goyal, Ilias Marinos (Microsoft Research), Chenxingyu Zhao (University of Washington, Microsoft Research), Radhika Mittal (UIUC), Ranveer Chandra (Microsoft Research)

    • Abstract: We consider the problem of hosting financial exchanges in the cloud. Exchanges necessitate strong fairness guarantees for competing participants, particularly for use cases such as "high frequency trading". Today, exchanges achieve such guarantees by providing equal latency across all market participants in their on-premise deployments. However, ensuring equal latency for fairness is notably challenging in current multi-tenant cloud deployments, mainly due to factors such as network congestion and non-equidistant network paths.

      In this paper, we address the problem of unfairness stemming from unpredictable and unbounded network latency in cloud networks. Taking inspiration from the use of logical clocks in distributed systems, we present Delivery Based Ordering (DBO), a novel mechanism that guarantees fairness by post-hoc offsetting the latency differences among market participants in the cloud. We thoroughly evaluate DBO in simulation, a bare-metal testbed and a public cloud deployment, and we demonstrate that it is feasible to guarantee fairness while operating at high transaction rates with a sub-100μs end-to-end latency.


  • Exoshuffle: An Extensible Shuffle Architecture

    Artifac availableArtifacts Available     

    Frank Sifei Luan, Stephanie Wang, Samyukta Yagati, Sean Kim, Kenneth Lien, Isaac Ong, Tony Hong (UC Berkeley), SangBin Cho, Eric Liang (Anyscale), Ion Stoica (UC Berkeley)

    • Abstract: Shuffle is one of the most expensive communication primitives in distributed data processing and is difficult to scale. Prior work addresses the scalability challenges of shuffle by building monolithic shuffle systems. These systems are costly to develop, and they are tightly integrated with batch processing frameworks that offer only high-level APIs such as SQL. New applications, such as ML training, require more flexibility and finer-grained interoperability with shuffle. They are often unable to leverage existing shuffle optimizations.

      We propose an extensible shuffle architecture. We present LibShuffle, a library for distributed shuffle that offers competitive performance and scalability as well as greater flexibility than monolithic shuffle systems. We design an architecture that decouples the shuffle control plane from the data plane without sacrificing performance. We build LibShuffle on Ray, a distributed futures system for data and ML applications, and demonstrate that we can: (1) rewrite previous shuffle optimizations as application-level libraries with an order of magnitude less code, (2) achieve shuffle performance and scalability competitive with monolithic shuffle systems, and break the CloudSort record as the world's most cost-efficient sorting system, and (3) enable new applications such as ML training to easily leverage scalable shuffle.


  • 7:00pm–9:30pm     Banquet

    Slack channel     Location: cruise (leaves Pier 40 and starts boarding at 6:30pm)

  • End of the day

  • Wednesday, September 13, 2023

  • 8:00am–10:00am     Breakfast

    Location: Lerner Hall

  • 9:30am–10:30am     Technical Session 10: Equity

    Slack channel     Location: Lerner Hall

  • Decoding the Divide: Analyzing Disparities in Broadband Plans Offered by Major US ISPs

    Udit Paul, Vinothini Gunasekaran, Jiamo Liu (University of California Santa Barbara), Tejas N. Narechania (University of California Berkeley), Arpit Gupta (University of California Santa Barbara), Elizabeth Belding (University of California, Santa Barbara)

    • Abstract: Digital equity in Internet access is often measured along three axes: availability, affordability, and adoption. Most prior work focuses on availability; the other two aspects have received little attention. In this paper, we study broadband affordability in the US. Specifically, we focus on the nature of broadband plans offered by major ISPs across the US. To this end, we develop a broadband plan querying tool (BQT) that obtains broadband plans (upload/download speed and price) offered by seven major ISPs for any street address in the US. We then use this tool to curate a dataset, querying broadband plans for over 837 k street addresses in thirty cities for seven major wireline broadband ISPs. We use a plan's carriage value, the Mbps of a user's traffic that an ISP carries for one dollar, to compare plans. Our analysis provides us with the following new insights: (1) ISP plans vary inter-city. Specifically, the fraction of census block groups that receive high and low carriage value plans varies widely by city; (2) ISP plans intra-city are spatially clustered, and the carriage value can vary as much as 600% within a city; (3) Cable-based ISPs offer up to 30% more carriage value to users when competing with fiber-based ISPs in a block group; and (4) Average income in a block group plays a critical role in dictating who gets a fiber deployment (i.e., a better carriage value) in the US. While we hope our tool, dataset, and analysis in their current form are helpful for policymakers at different levels (city, county, state), they are only a small step toward understanding digital equity. Based on our learnings, we conclude with recommendations to continue to advance our understanding of broadband affordability.


  • A Framework for Improving Web Affordability and Inclusiveness

    Artifac availableArtifacts Available     

    Rumaisa Habib, Sarah Tanveer, Aimen Inam, Haseeb Ahmed, Ayesha Ali, Zartash Afzal Uzmi, Zafar Ayyub Qazi, Ihsan Ayyub Qazi (LUMS)

    • Abstract: Today’s Web remains too expensive for many Internet users, especially in developing regions. Unfortunately, the rising complexity of the Web makes affordability an even bigger concern as it stands to limit users’ access to Internet services. We propose a novel framework and a fairness metric for rethinking Web architecture for affordability and inclusion. Our proposed framework systematically adapts Web complexity based on geographic variations in mobile broadband prices and income levels. We conduct a cross-country analysis of 99 countries, showing that our framework can better balance affordability and webpage quality while preserving user privacy. To adapt Web complexity, our framework solves an optimization problem to produce webpages that maximize page quality while reducing the webpage to a given target size.


  • Destination Unreachable: Characterizing Internet Outages and Shutdowns

    Artifac availableArtifacts Available     

    Zachary S. Bischof (Georgia Tech), Kennedy Pitcher (UC San Diego), Esteban Carisimo (Northwestern University), Amanda Meng (Georgia Tech), Rafael Bezerra Nunes (Yale University), Ramakrishna Padmanabhan (Amazon), Margaret E. Roberts (University of California at San Diego), Alex C. Snoeren (UC San Diego), Alberto Dainotti (Georgia Tech)

    • Abstract: In this paper, we provide the first comprehensive longitudinal analysis of government-ordered Internet shutdowns and spontaneous outages (i.e., disruptions not ordered by the government). We describe the available tools, data sources and methods to identify and analyze Internet shutdowns. We then merge manually curated datasets on known government-ordered shutdowns and large-scale Internet outages, further augmenting them with data on real-world events, macroeconomic and sociopolitical indicators, and network operator statistics. Our analysis confirms previous findings on the economic and political profiles of countries with government-ordered shutdowns. Extending this analysis, we find that countries with national-scale spontaneous outages often have profiles similar to countries with shutdowns, differing from countries that experience neither. However, we find that government-ordered shutdowns are many more times likely to occur on days of mobilization, coinciding with elections, protests, and coups. Our study also characterizes the temporal characteristics of Internet shutdowns and finds that they differ significantly in terms of duration, recurrence interval, and start times when compared to spontaneous outages.


  • Global, Passive Detection of Connection Tampering

    Ram Sundara Raman (University of Michigan, Cloudflare Inc.), Louis-Henri Merino (EPFL), Kevin Bock (University of Maryland), Marwan Fayed (Cloudflare Inc.), Dave Levin (University of Maryland), Nick Sullivan, Luke Valenta (Cloudflare Inc.)

    • Abstract: In-network devices around the world monitor and tamper with connections for many reasons, including intrusion prevention, combating spam or phishing, and country-level censorship. Connection tampering seeks to block access to specific domain names or keywords, and it affects billions of users worldwide with little-to-no transparency. To detect, diagnose, and measure connection-level blocking, "active" measurement techniques originate queries with domains or keywords believed to be blocked and send them from vantage points within networks of interest. Active measurement efforts have been critical to understanding how traffic tampering occurs, but they inherently are unable to capture critical parts of the picture. For instance, knowing the set of domains in a block-list (i.e., what *could* get blocked) is not the same as knowing what real users are actively experiencing (i.e., what is *actively* getting blocked).

      We present the first global study of connection tampering through a *passive* analysis of traffic received at a global CDN, Cloudflare. We analyze a sample of traffic to all of Cloudflare's servers to construct the first comprehensive list of *tampering signatures*: sequences of packet headers that are indicative of connection tampering. We then apply these tampering signatures to analyze our global dataset of real user traffic, yielding a more comprehensive view of connection tampering than has been possible with active measurements alone. In particular, our passive analysis allows us to report on how connection tampering is actively affecting users and clients from virtually every network, without active probes, vantage points in difficult-to-reach networks and regions, or test lists (which we analyze for completeness against our results). Our study shows that passive measurement can be a powerful complement to active measurement in understanding connection tampering and improving transparency.


  • 10:30am-10:45am     Break

  • 10:45am–11:55am     Technical Session 11: Multiple Paths

    Slack channel     Location: Lerner Hall

  • Converge: QoE-driven Multipath Video Conferencing over WebRTC

    Sandesh Dhawaskar Sathyanarayana (University of Colorado Boulder), Kyunghan Lee (Seoul National University), Dirk Grunwald, Sangtae Ha (University of Colorado Boulder)

    • Abstract: Video conferencing has become a daily necessity, but protocols to support video conferencing have yet to keep pace despite the innovation in next-generation networks. As video resolutions increase and mobile applications using multiple cameras for photos and videos become popular, the need to meet the Quality of Experience (QoE) requirements is growing. Multipath protocols could be a possible solution.

      In this paper, we show that a straightforward extension of WebRTC for supporting multipath can perform worse than legacy WebRTC and propose CONVERGE, a WebRTC-compliant multipath video conferencing platform. CONVERGE improves QoE through three main components: a video-aware scheduler, video QoE feedback, and video-aware and path-specific packet protection. The video-aware scheduler uses the real-time video structure to schedule packets, and the video QoE feedback from the receiver helps the scheduler adjust the number of packets on each path. Additionally, the video-aware and path-specific packet protection mechanism improves on the existing FEC mechanism in WebRTC by considering the trade-off between FEC and QoE. CONVERGE is built as part of the Chromium browser, making it compatible with any device or network path. CONVERGE improves overall media throughput by 1.2×, reduces end-to-end latency by 20%, and enhances image quality by 55% compared to WebRTC.


  • Resilient Baseband Processing in Virtualized RANs with Slingshot

    Nikita Lazarev (MIT), Tao Ji (The University of Texas at Austin), Anuj Kalia (Microsoft), Daehyeok Kim (The University of Texas at Austin and Microsoft), Ilias Marinos, Francis Y. Yan (Microsoft), Christina Delimitrou (MIT), Zhiru Zhang (Cornell University), Aditya Akella (The University of Texas at Austin)

    • Abstract: In cellular networks, there is a growing adoption of virtualized radio access networks (vRANs), where operators are replacing the traditional specialized hardware for RAN processing with software running on commodity servers. Today's vRAN deployments lack resilience, since there is no support for vRAN failover or upgrades without long service interruptions. Enabling these features for vRANs is challenging because of their strict real-time latency requirements and black-box nature. Slingshot is a new system that transparently provides resilience for the vRAN's most performance-critical layer: the physical layer (PHY). We design new techniques for realtime workload migration with fast RAN protocol middleboxes, and realtime RAN failure detection. A key insight in our design is to view the transient disruptions from resilience events to RAN computation state and I/O similarly to regular wireless signal impairments, and leverage the inherent resilience of cellular networks to these events. Experiments with a state-of-the-art 5G vRAN testbed show that Slingshot handles PHY failover with no disruption to video conferencing, and under 110ms of disruption to a TCP connection, and it also enables zero-downtime upgrades.


  • CellFusion: Multipath Vehicle-to-Cloud Video Streaming with Network Coding in the Wild

    Yunzhe Ni (Alibaba Cloud & Peking Univ.), Zhilong Zheng, Xianshang Lin, Fengyu Gao, Xuan Zeng, Yirui Liu, Tao Xu, Hua Wang, Zhidong Zhang, Senlang Du, Guang Yang, Yuanchao Su, Dennis Cai (Alibaba Cloud), Hongqiang Harry Liu (Uber Technology), Chenren Xu (Peking Univ.), Ennan Zhai, Yunfei Ma (Alibaba Cloud)

    • Abstract: This paper presents CellFusion, a system designed for high-quality, real-time video streaming from vehicles to the cloud. It leverages an innovative blend of multipath QUIC transport and network coding. Surpassing the limitations of individual cellular carriers, CellFusion uses a unique last-mile overlay that integrates multiple cellular networks into a single, unified cloud connection. This integration is made possible through the use of in-vehicle Customer Premises Equipment (CPEs) and edge-cloud proxy servers.

      In order to effectively handle unstable cellular connections prone to intense burst losses and unexpected latency spikes as a vehicle moves, CellFusion introduces XNC. This innovative network coding-based transport solution enables efficient and resilient multipath transport. XNC aims to accomplish low latency, minimal traffic redundancy, and reduced computational complexity all at once. CellFusion is secure and transparent by nature and does not require modifications for vehicular apps connecting to it.

      We tested CellFusion on 100 self-driving vehicles for over six months with our cloud-native back-end running on 50 CDN PoPs. Through extensive road tests, we show that XNC reduced video packet delay by 71.53% at the 99th percentile versus 5G. At 30Mbps, CellFusion achieved 66.11% ∼ 80.62% reduction in video stall ratio versus state-of-the-art multipath transport solutions with less than 10% traffic redundancy.


  • Improving Network Availability with Protective ReRoute

    David Wetherall (Google), Abdul Kabbani (Microsoft), Van Jacobson, Jim Winget, Yuchung Cheng, Charles B. Morrey III, Uma Moravapalle, Phillipa Gill, Steven Knight, Amin Vahdat (Google)

    • Abstract: We present PRR (Protective ReRoute), a transport technique for shortening user-visible outages that complements routing repair. It can be added to any transport to provide benefits in multipath networks. PRR responds to flow connectivity failure signals, e.g., retransmission timeouts, by changing the FlowLabel on packets of the flow, which causes switches and hosts to choose a different path that may avoid the outage. To enable it, we shifted our IPv6 network architecture to use the FlowLabel, so that hosts can select paths for their flows transparently to applications. PRR is deployed fleetwide at Google for TCP and Pony Express, where it has been protecting all production traffic for several years. It is also available to our Cloud customers. We find it highly effective for real outages. In a measurement study on our network backbones, adding PRR reduced the cumulative region-pair outage time for RPC traffic by 63-84%. This is the equivalent of adding 0.4–0.8 "nines" of availability.


  • XRON: A Hybrid Elastic Cloud Overlay Network for Video Conferencing at Planetary Scale

    Bingyang Wu (Peking University and Alibaba Cloud), Kun Qian, Bo Li, Yunfei Ma, Qi Zhang, Zhigang Jiang, Jiayu Zhao, Dennis Cai, Ennan Zhai (Alibaba Cloud), Xuanzhe Liu, Xin Jin (Peking University)

    • Abstract: Quality and cost are two key considerations for video conferencing services. Video conferencing providers face a dilemma when selecting network tiers to build their infrastructure—relying on Internet links has poor video conferencing quality, while using premium links brings excessive cost.

      We present XRON, a hybrid elastic cloud overlay network for our planetary-scale video conferencing service. XRON differs from prior overlay networks with two distinct features. First, XRON is a hybrid overlay that leverages both Internet links and premium links to achieve both high quality and low cost. Second, XRON exploits elastic cloud resources to adaptively scale its capacity based on realtime demand. The data plane of XRON combines active probing and passive tracking for scalable link state monitoring, uses asymmetric forwarding based on heterogeneous bidirectional link qualities, and quickly reacts to sudden link degradations without the control plane involvement. The control plane of XRON predicts video traffic based on application knowledge, and computes global forwarding paths and reaction plans with scalable algorithms. Large-scale production deployment shows that XRON reduces video stall ratio and bad audio fluency by 77% and 65.2%, respectively, compared to using Internet links only, and reduces cost by 4.73×, compared to using premium links only.


  • 12:00pm–1:30pm     Lunch

  • 12:00pm–1:30pm     Student Research Competition

    Location: Lerner Hall

  • 1:30pm–2:30pm     Technical Session 12: Video Analysis

    Slack channel     Location: Lerner Hall

  • ZGaming: Zero-Latency 3D Cloud Gaming by Image Prediction

    Jiangkai Wu, Yu Guan (Peking University), Qi Mao (Communication University of China), Yong Cui (Tsinghua University), Zongming Guo, Xinggong Zhang (Peking University)

    • Abstract: In cloud gaming, interactive latency is one of the most important factors in users' experience. Although the interactive latency can be reduced through typical network infrastructures like edge caching and congestion control, the interactive latency of current cloud-gaming platforms is still far from users' satisfaction. This paper presents ZGaming, a novel 3D cloud gaming system based on image prediction, in order to eliminate the interactive latency in traditional cloud gaming systems. To improve the quality of the predicted images, we propose (1) a quality-driven 3D-block cache to reduce the "hole" artifacts, (2) a server-assisted LSTM-predicting algorithm to improve the prediction accuracy of dynamic foreground objects, and (3) a prediction-performance-driven adaptive bitrate strategy which optimizes the quality of predicted images. The experiment on the real-world cloud gaming network conditions shows that compared with existing methods, ZGaming reduces the interactive latency from 23 ms to 0 ms when providing the same video quality, or improves the video quality by 5.4 dB when keeping the interactive latency as 0 ms.


  • PacketGame: Multi-Stream Packet Gating for Concurrent Video Inference at Scale

    Mu Yuan, Lan Zhang, Xuanke You, Xiang-Yang Li (University of Science and Technology of China)

    • Abstract: The resource efficiency of video analytics workloads is critical for large-scale deployments on edge nodes and cloud clusters. Recent advanced systems have benefited from techniques including video compression, frame filtering, and deep model acceleration. However, based on our year-long experience of operating a real-time video analytics system on more than 1000 cameras, we identified a previously overlooked bottleneck of end-to-end concurrency: video decoding. To support concurrent video inference at scale, in this work, we investigate a new task, named video packet gating, which selectively filters packets before running a decoder. We propose a novel multi-view embedding approach for video packets and present PacketGame that has both theoretical performance guarantee and practical system designs. Experiments on both public datasets and a real system show PacketGame saves 52.0-79.3% decoding costs and achieves 2.1-4.8× concurrency compared to original workloads. Comparisons with four state-of-the-art complementary methods show the superiority of PacketGame in end-to-end concurrency.


  • Veritas: Answering Causal Queries from Video Streaming Traces

    Artifac availableArtifacts Available     

    Chandan Bothra, Jianfei Gao, Sanjay Rao, Bruno Ribeiro (Purdue University)

    • Abstract: In this paper, we consider the task of answering what-if questions in the context of adaptive bit rate (ABR) video streaming without access to randomized control trials (RCTs) (e.g., no A/B testing) – i.e., given recorded data of an existing deployed system, what would be the performance impact if we changed its design. Our work makes three contributions. First, we show the problem is challenging since data may only be available for a single ABR algorithm without RCTs, and since it is necessary to deal with the cascading effects that past ABR decisions have on future decisions. Next we present Veritas, the first framework that tackles causal reasoning for video streaming without requiring data collected through RCTs. Integral to Veritas is an easy-to-interpret domain-specific ML model that relates the latent stochastic process (intrinsic bandwidth that the video session can achieve) to actual observations (download times), while exploiting counterfactual queries via abduction using the observed TCP states (e.g., congestion window) for blocking the cascading dependencies. Third, we evaluate Veritas’s ability to accurately answer a wide range of what-if questions using emulation experiments, and data of real video sessions from Puffer. The results show that (i) Veritas accurately tackles a wider range of what-if questions (e.g., change of buffer size or video quality) that existing approaches cannot; (ii) Veritas without RCT training data achieves performance comparable or better than a recent parallel approach that requires RCT data; and (iii) in many scenarios Veritas achieves accuracy close to an ideal oracle.


  • Sammy: smoothing video traffic to be a friendly internet neighbor

    Artifac availableArtifacts Available     

    Bruce Spang (Stanford University), Shravya Kunamalla, Renata Teixeira, Te-Yuan Huang, Grenville Armitage (Netflix), Ramesh Johari, Nick McKeown (Stanford University)

    • Abstract: On-demand streaming video traffic is managed by an adaptive bitrate (ABR) algorithm whose job is to optimize quality of experience (QoE) for a single video session. ABR algorithms leave the question of sharing network resources up to transport-layer algorithms. We observe that as the internet gets faster relative to video streaming rates, this delegation of responsibility gives video traffic a burstier on-off traffic pattern. In this paper, we show we can substantially smooth video traffic to improve its interactions with the rest of the internet, while maintaining the same or better QoE for streaming video. We smooth video traffic with two design principles: application-informed pacing, which allows ABR algorithms to set an upper limit on packet-by-packet throughput, and by designing ABR algorithms that work with pacing. We propose a joint ABR and rate-control scheme, called Sammy, which selects both video quality and pacing rates. We implement our scheme and evaluate it at a large video streaming service. Our approach smooths video, making it a more friendly neighbor to other internet applications. One surprising result is that it requires no compromise for the video traffic: in large scale, production experiments, Sammy improves video QoE over an existing, finely-tuned production ABR algorithm.


  • 2:30pm-2:45pm     Break

  • 2:45pm–3:45pm     Technical Session 13: Data Center Programming

    Slack channel     Location: Lerner Hall

  • Achelous: Enabling Programmability, Elasticity, and Reliability in Hyperscale Cloud Networks

    Chengkun Wei (Zhejiang University), Xing Li (Zhejiang University and Alibaba Cloud), Ye Yang (Alibaba Cloud and Zhejiang University), Xiaochong Jiang, Tianyu Xu (Zhejiang University), Bowen Yang, Taotao Wu, Chao Xu, Yilong Lv, Haifeng Gao, Zhentao Zhang, Zikang Chen (Alibaba Cloud), Zeke Wang, Zihui Zhang (Zhejiang University), Shunmin Zhu (Tsinghua University and Alibaba Cloud), Wenzhi Chen (Zhejiang University)

    • Abstract: Cloud computing has witnessed tremendous growth, prompting enterprises to migrate to the cloud for reliable and on-demand computing. Within a single Virtual Private Cloud (VPC), the number of instances (such as VMs, bare metals, and containers) has reached millions, posing challenges related to supporting millions of instances with network location decoupling from the underlying hardware, high elastic performance, and high reliability. However, academic studies have primarily focused on specific issues like high-speed data plane and virtualized routing infrastructure, while existing industrial network technologies fail to adequately address these challenges.

      In this paper, we report on the design and experience of Achelous, Alibaba Cloud’s network virtualization platform. Achelous consists of three key designs to enhance hyperscale VPC: (i) a novel hierarchical programming architecture based on the collaborative design of both data plane and control plane; (ii) elastic performance strategy and distributed ECMP schemes for seamless scale-up and scale-out, respectively; (iii) health check scheme and transparent VM live migration mechanisms that ensure stateful flow continuity during the failover. The evaluation results demonstrate that, Achelous scales to over 1,500,000 of VMs with elastic network capacity in a single VPC, and reduces 25× programming time, with 99% updating can be completed within 1 second. For failover, it condenses 22.5× downtime during VM live migration, and ensures 99.99% of applications do not experience stall. More importantly, the experience from three years of operation proves the Achelous's serviceability, and versatility independent of any specific hardware platforms.


  • Klotski: Efficient and Safe Network Migration of Large Production Datacenters

    Yihao Zhao (Peking University), Xiaoxiang Zhang (Meta), Hang Zhu (Johns Hopkins University), Ying Zhang, Zhaodong Wang (Meta), Yuandong Tian (Meta AI), Alex Nikulkov, Joao Ferreira (Meta), Xuanzhe Liu, Xin Jin (Peking University)

    • Abstract: This paper presents the design, implementation, evaluation, and deployment of Meta's production network migration system. We first introduce the network migration problem for large-scale production datacenter networks (DCNs). A network migration task at Meta touches as many as hundreds of switches and tens of thousands of circuits per datacenter, and involves physical deployment work on site that can last months. We describe real-world migration challenges, covering complex and evolving DCN architectures and operational constraints. We mathematically formalize the problem of generating efficient and safe migration plans, and exploit the inherent symmetry and locality of DCN topologies to prune the search space. We design an ordering-agnostic compact topology representation to eliminate redundant satisfiability checking, and apply the A* algorithm with a domain-specific priority function to find the optimal plan. Evaluation results on a range of production migration cases show that Klotski reduces the time to find optimal migration plans by up to 381× compared to prior solutions. We hope by introducing the problem and sharing our deployment experience, this work can provide a useful context for network migration in the real world and inspire future research.


  • ClickINC: In-network Computing as a Service in Heterogeneous Programmable Data-center Networks

    Wenquan Xu, Zijian Zhang, Yong Feng (Tsinghua University), Haoyu Song (Futurewei Technologies), Zhikang Chen (Tsinghua University), Wenfei Wu (Peking University), Guyue Liu (New York University Shanghai), Yinchao Zhang, Shuxin Liu, Zerui Tian, Bin Liu (Tsinghua University)

    • Abstract: In-Network Computing (INC) has found many applications for performance boosts or cost reduction. However, given heterogeneous devices, diverse applications, and multi-path network typologies, it is cumbersome and error-prone for application developers to effectively utilize the available network resources and gain predictable benefits without impeding normal network functions. Previous work is oriented to network operators more than application developers. We develop ClickINC to streamline the INC programming and deployment using a unified and automated workflow. ClickINC provides INC developers a modular programming abstractions, without concerning to the states of the devices and the network topology. We describe the ClickINC framework, model, language, workflow, and corresponding algorithms. Experiments on both an emulator and a prototype system demonstrate its feasibility and benefits.


  • Network Load Balancing with In-network Reordering Support for RDMA

    Artifac availableArtifacts Available     

    Cha Hwan Song (National university of Singapore), Xin Zhe Khooi, Raj Joshi, Inho Choi, Jialin Li, Mun Choon Chan (National University of Singapore)

    • Abstract: Remote Direct Memory Access (RDMA) is widely used in high-performance computing (HPC) and data center networks. In this paper, we first show that RDMA does not work well with existing load balancing algorithms because of its traffic flow characteristics and assumption of in-order packet delivery. We then propose ConWeave, a load balancing framework designed for RDMA. The key idea of ConWeave is that with the right design, it is possible to perform fine granularity rerouting and mask the effect of out-of-order packet arrivals transparently in the network datapath using a programmable switch. We have implemented ConWeave on a Tofino2 switch. Evaluations show that ConWeave can achieve up to 42.3% and 66.8% improvement for average and 99-percentile FCT, respectively compared to the state-of-the-art load balancing algorithms.


  • 3:45pm-4:15pm     Break

  • 4:15pm–5:15pm     Technical Session 14: Telemetry

    Slack channel     Location: Lerner Hall

  • Direct Telemetry Access

    Artifac availableArtifacts Available     

    Jonatan Langlet (Queen Mary University of London), Ran Ben Basat (University College London), Gabriele Oliaro (Carnegie Mellon University), Michael Mitzenmacher, Minlan Yu (Harvard University), Gianni Antichi (Politecnico di Milano)

    • Abstract: Fine-grained network telemetry is becoming a modern datacenter standard and is the basis of essential applications such as congestion control, load balancing, and advanced troubleshooting. As network size increases and telemetry gets more fine-grained, there is a tremendous growth in the amount of data needed to be reported from switches to collectors to enable network-wide view. As a consequence, it is progressively hard to scale data collection systems.

      We introduce Direct Telemetry Access (DTA), a solution optimized for aggregating and moving hundreds of millions of reports per second from switches into queryable data structures in collectors’ memory. DTA is lightweight and it is able to greatly reduce overheads at collectors. DTA is built on top of RDMA, and we propose novel and expressive reporting primitives to allow easy integration with existing state-of-the-art telemetry mechanisms such as INT or Marple.

      We show that DTA significantly improves telemetry collection rates. For example, when used with INT, it can collect and aggregate over 400M reports per second with a single server, improving over the Atomic MultiLog by up to 16x.


  • GGFAST: Automating Generation of Flexible Network Traffic Classifiers

    Julien Piet (Corelight and University of California Berkeley), Dubem Nwoji (Corelight), Vern Paxson (Corelight and University of California Berkeley)

    • Abstract: When employing supervised machine learning to analyze network traffic, the heart of the task often lies in developing effective features for the ML to leverage. We develop GGFAST, a unified, automated framework that can build powerful classifiers for specific network traffic analysis tasks, built on interpretable features. The framework uses only packet sizes, directionality, and sequencing, facilitating analysis in a payload-agnostic fashion that remains applicable in the presence of encryption. GGFAST analyzes labeled network data to identify n-grams (“snippets”) in a network flow’s sequence-of-message-lengths that are strongly indicative of given categories of activity. The framework then produces a classifier that, given new (unlabeled) network data, identifies the activity to associate with each flow by assessing the presence (or absence) of snippets relevant to the different categories. We demonstrate the power of our framework by building—without any case-specific tuning—highly accurate analyzers for multiple types of network analysis problems. These span traffic classification (L7 protocol identification), finding DNS-over-HTTPS in TLS flows, and identifying specific RDP and SSH authentication methods. Finally, we demonstrate how, given ciphersuite specifics, we can transform a GGFAST analyzer developed for a given type of traffic to automatically detect instances of that activity when tunneled within SSH or TLS.


  • OmniWindow: A General and Efficient Window Mechanism Framework for Network Telemetry

    Artifac availableArtifacts Available     

    Haifeng Sun, Jiaheng Li, Jintao He, Jie Gui, Qun Huang (Peking University)

    • Abstract: Recent network telemetry solutions typically target programmable switches to achieve high performance and in-network visibility. They partition the packet stream into windows and then apply various stream processing techniques to summarize flow-level statistics. However, existing studies focus on the measurement within each window. Window management is still a missing piece due to the resource limitation of programmable switches. In this paper, we propose OmniWindow, a general and efficient window mechanism framework. OmniWindow splits the original window into fine-grained sub-windows such that the sub-windows can be merged into various window types. To deal with the resource restriction, OmniWindow carefully designs its data plane memory layout and proposes a window synchronization method. It also employs a collaborative architecture that can collect and reset stateful data in sub-windows within a limited time. We prototype OmniWindow on Tofino. We incorporate OmniWindow into a SOTA query-driven telemetry system and eight sketch-based telemetry algorithms. Our experiments demonstrate that OmniWindow enables these telemetry solutions to achieve higher accuracy than conventional window mechanism.


  • ChameleMon: Shifting Measurement Attention as Network State Changes

    Kaicheng Yang, Yuhan Wu, Ruijie Miao, Tong Yang, Zirui Liu, Zicang Xu, Rui Qiu, Yikai Zhao, Hanglong Lv (Peking University), Zhigang Ji (Huawei Technologies Co., Ltd.), Gaogang Xie (CNIC CAS)

    • Abstract: Network measurement is critical to many network applications. There are mainly two kinds of flow-level measurement tasks: 1) packet accumulation tasks and 2) packet loss tasks. In practice, the two kinds of tasks are often required at the same time, but existing works seldom handle both. In this paper, we design ChameleMon to support the two kinds of tasks simultaneously. The key design of ChameleMon is to shift measurement attention as network state changes, through two dimensions of dynamics: 1) dynamically allocating memory between the two kinds of tasks; 2) dynamically monitoring the flows of importance. To realize the key design, we propose a key technique, leveraging Fermat’s little theorem to devise a flexible data structure, namely FermatSketch. FermatSketch is dividable, additive, and subtractive, supporting the two kinds of tasks. We have implemented a ChameleMon prototype on a testbed with a Fat-tree topology. We conduct extensive experiments and the results show ChameleMon supports the two kinds of tasks with low memory/bandwidth overhead, and more importantly, it can automatically shift measurement attention as network state changes.


  • 6:00pm–8:00pm     Community Meeting

    Slack channel     Location: Lerner Hall

  • End of the day

  • Thursday, September 14, 2023

  • 8:00am–10:00am     Breakfast

    Location: Lerner Hall

  • 9:00am–10:00am     Technical Session 15: All Layers Considered

    Slack channel     Location: Lerner Hall

  • IPv6 Hitlists at Scale: Be Careful What You Wish For

    Artifac availableArtifacts Available     

    Erik Rye, Dave Levin (University of Maryland)

    • Abstract: Today's network measurements rely heavily on Internet-wide scanning, using tools like ZMap that are capable of quickly iterating over the entire IPv4 address space. Unfortunately, IPv6's vast address space poses an existential threat for Internet-wide scans and traditional network measurement techniques. To address this reality, efforts are underway to develop "hitlists" of known-active IPv6 addresses to reduce the search space for would-be scanners. As a result, there is an inexorable push for constructing as large and complete a hitlist as possible.

      This paper asks: what are the potential benefits and harms when IPv6 hitlists grow larger? To answer this question, we obtain the largest IPv6 active-address list to date: 7.9 billion addresses, 898 times larger than the current state-of-the-art hitlist. Although our list is not comprehensive, it is a significant step forward and provides a glimpse into the type of analyses possible with more complete hitlists.

      We compare our dataset to prior IPv6 hitlists and show both benefits and dangers. The benefits include improved insight into client devices (prior datasets consist primarily of routers), outage detection, IPv6 roll-out, previously unknown aliased networks, and address assignment strategies. The dangers, unfortunately, are severe: we expose widespread instances of addresses that permit user tracking and device geolocation, and a dearth of firewalls in home networks. We discuss ethics and security guidelines to ensure a safe path towards more complete hitlists.


  • Regional IP Anycast: Deployments, Performance, and Potentials

    Minyuan Zhou (Nanjing University), Xiao Zhang (Duke University), Shuai Hao (Old Dominion University), Xiaowei Yang (Duke), Jiaqi Zheng, Guihai Chen, Wanchun Dou (Nanjing University)

    • Abstract: Recent studies show that an end system’s traffic may reach a faraway anycast site in a global IP anycast system, resulting in high latency. To address this issue, some private and public CDNs have implemented regional IP anycast. This approach involves dividing content-hosting sites into geographic regions, announcing a unique IP anycast prefix for each region, and utilizing DNS and IP-geolocation to direct clients to a CDN site in their corresponding geographic area. In this work, we aim to understand how a regional anycast CDN partitions its sites and maps its customers’ clients, and how a regional anycast CDN performs compared to its global anycast counterpart. We study the deployment strategies and the performance of two CDNs (Edgio and Imperva) that currently deploy regional IP anycast. We find that both Edgio and Imperva partition their sites and clients following continent or country borders. Furthermore, we compare the client latency distribution in Imperva’s regional anycast CDN with its similar-scale DNS global anycast network, while accounting for and mitigating the relevant deployment differences between the two networks. We find that regional anycast can effectively alleviate the pathology in global IP anycast where BGP routes a client’s traffic to a distant CDN site. However, DNS mapping inefficiencies, where DNS returns a sub-optimal regional IP anycast address that does not cover a client’s low-latency CDN sites, can harm regional anycast’s performance. Finally, using the Tangled testbed, we demonstrate the performance benefits of regional IP anycast using a latency-based region partition method. The results show that the 90th percentile client latency is reduced by 58.7% to 78.6% for clients in various geographic areas when compared to a global anycast configuration.


  • A Formal Framework for End-to-End DNS Resolution

    Artifac availableArtifacts Available     

    Si Liu, Huayi Duan, Lukas Heimes, Marco Bearzi, Jodok Vieli, David Basin, Adrian Perrig (ETH Zurich)

    • Abstract: Despite the preeminent importance of DNS, numerous new attacks and vulnerabilities are regularly discovered. The root of the problem is the ambiguity and tremendous complexity of DNS protocol specifications, amid a rapidly evolving Internet infrastructure. To counteract the vicious break-and-fix cycle for improving DNS infrastructure, we instigate a foundational approach: we construct the first formal semantics of end-to-end name resolution, a collection of components for the formal analyses of both qualitative and quantitative properties, and an automated tool for discovering new DoS attacks. Our formal framework represents an important step towards a substantially more secure and reliable DNS infrastructure.


  • Beyond Limits: How to Disable Validators in Secure Networks

    Tomas Hlavacek, Philipp Jeitner (Fraunhofer SIT), Donika Mirdita (Fraunhofer SIT, TU Darmstadt), Haya Shulman (Goethe-University Frankfurt and Fraunhofer SIT), Michael Waidner (Fraunhofer SIT and TU Darmstadt)

    • Abstract: Relying party validator is a critical component of RPKI: it fetches and validates signed authorizations mapping prefixes to their owners. Routers use this information to block bogus BGP routes.

      Since the processing time of validators is not limited, malicious repositories could stall them. To limit the time that RPKI validators spend on downloading RPKI objects, thresholds were introduced into all popular implementations.

      We perform the first analysis of the thresholds. On the one hand, we show that the current thresholds are too permissive and hence do not prevent attacks. On the other hand, we show that even those permissive thresholds cause 11.78% failure rate in validators. We find experimentally that although stricter thresholds would make attacks more difficult they would significantly increase the failure rates. Our analysis shows that no matter what balance between permissive-strict thresholds is struck, one of the problems, either failures or exposure to attacks, will always persist.

      As a solution against attacks and failures we develop a sort-and-limit algorithm for validators. We demonstrate through extensive evaluations on a simulated platform that our algorithm prevents the attacks and failures not only in the current but also in full RPKI deployment.


  • 10:00am-10:15am     Break

  • 10:15am–11:15am     Technical Session 16: Caching and Provisioning

    Slack channel     Location: Lerner Hall

  • P4LRU: Towards An LRU Cache Entirely in Programmable Data Plane

    Artifac availableArtifacts Available     

    Yikai Zhao, Wenrui Liu, Fenghao Dong, Tong Yang, Yuanpeng Li, Kaicheng Yang, Zirui Liu (Peking University), Zhengyi Jia, Yongqiang Yang (Huawei Cloud Computing Technologies Co Ltd)

    • Abstract: The data plane cache, a critical functionality found in numerous network devices, such as programmable switches, intelligent NICs, and DPUs, is often subject to limitations in its programmability and memory access capacity. As a result, the majority of existing data plane caches rely on simple and inefficient replacement policies. This paper is set to introduce LRU, a near-optimal replacement policy, into the programmable data plane. We first explore the reasons why the traditional implementation of LRU is not suitable for deployment on the data plane. Consequently, we propose P4LRU, a pipeline-optimized version of the LRU implementation. Building on P4LRU, we conceive three distinct in-network systems – LruTable, LruIndex, and LruMon, and successfully bring them to life on Tofino switches. Our thorough experimental trials establish that P4LRU provides a significant performance boost over existing data plane caches in these three systems. We have open-sourced the source codes for the three systems on GitHub.


  • Darwin: Flexible Learning-based CDN Caching

    Artifac availableArtifacts Available     

    Jiayi Chen, Nihal Sharma, Tarannum Khan (The University of Texas at Austin), Shu Liu (UC Berkeley), Brian Chang, Aditya Akella, Sanjay Shakkottai (The University of Texas at Austin), Ramesh K. Sitaraman (UMass Amherst & Akamai Tech)

    • Abstract: Cache management is critical for Content Delivery Networks (CDNs), impacting their performance and operational costs. Most production CDNs apply static, hand-tuned caching policy parameters at cache servers, such as admission frequency or size thresholds for the Hot Object Caches (HOC) of their system. However, these static policies fall short when a server is faced with unpredictable traffic pattern changes, even when policies employ multiple control parameters/knobs. Recent approaches have proposed learning-based solutions to dynamically adjust policy parameters, but they are limited in action space, caching objectives, or impose high overhead. We propose Darwin, a CDN cache management system that is robust to traffic pattern changes and can flexibly optimize different caching objectives with unrestricted action spaces. Darwin employs a three-stage pipeline involving traffic pattern feature collection, unsupervised clustering for classification, and neural bandit expert selection to choose the optimal caching policy. Through extensive simulations, experiments using an Apache Traffic Server (ATS)-based prototype, and theoretical analysis, we show that Darwin achieves significant performance gain w.r.t. different objectives such as maximizing object hit rates and minimizing disk writes, while simultaneously adapting to traffic pattern shifts. Darwin imposes negligible overhead and achieves high throughput compared to the state-of-the-art.


  • Switchboard: Efficient Resource Management for Conferencing Services

    Rahul Bothra, Rohan Gandhi, Ranjita Bhagwan, Venkata N. Padmanabhan (Microsoft Research India), Rui Liang, Steve Carlson (Microsoft), Vinayaka Kamath, Sreangsu Acharyya (Microsoft Research India), Ken Sueda, Somesh Chaturmohta (Microsoft), Harsha Sharma (MIT)

    • Abstract: Resource management is important for conferencing services (such as Microsoft Teams, Zoom) to ensure good user experience while keeping the costs low. Key to this is the efficient provisioning and assignment of media processing (MP) servers, which do the heavy lifting of mixing and redistributing the media streams from and to the call participants. We introduce Switchboard – a controller for efficient resource management for conferencing services. Switchboard is peakaware, recognizing that cost depends on the peak resource usage and that there is a temporal shift in peak demand across time zones. This allows a server in a region to serve calls at peak time, and double up as backup for other regions during non-peak times. Furthermore, it improves efficiency by performing joint network and compute provisioning and application aware provisioning. We evaluate Switchboard using 1+ year of records from Microsoft Teams. Switchboard achieves upto 51% lower provisioning cost while achieving similar or better latency over state-of-the-art baselines.


  • LEED: A Low-Power, Fast Persistent Key-Value Store on SmartNIC JBOFs

    Artifac availableArtifacts Available     

    Zerui Guo (University of Wisconsin-Madison), Hua Zhang (Beihang University), Chenxingyu Zhao (University of Washington), Yuebin Bai (Beihang University), Michael Swift, Ming Liu (University of Wisconsin-Madison)

    • Abstract: The recent emergence of low-power high-throughput programmable storage platforms—SmartNIC JBOF (just-a-bunch-of-flash)—motivates us to rethink the cluster architecture and system stack for energy-efficient large-scale data-intensive workloads. Unlike conventional systems that use an array of server JBOFs or embedded storage nodes, the introduction of SmartNIC JBOFs has drastically changed the cluster compute, memory, and I/O configurations. Such an extremely imbalanced architecture makes prior system design philosophies and techniques either ineffective or invalid.

      This paper presents LEED, a distributed, replicated, and persistent key-value store over an array of SmartNIC JBOFs. Our key ideas to tackle the unique challenges induced by a SmartNIC JBOF are: trading excessive I/O bandwidth for scarce SmartNIC core computing cycles and memory capacity; making scheduling decisions as early as possible to streamline the request execution flow. LEED systematically revamps the software stack and proposes techniques across per-SSD, intra-JBOF, and inter-JBOF levels. Our prototyped system based on Broadcom Stingray outperforms existing solutions that use beefy server JBOFs and wimpy embedded storage nodes by 4.2×/3.8× and 17.5×/19.1× in terms of requests per Joule for 256B/1KB key-value objects.


  • 11:15am-11:30am     Break

  • 11:30am–12:30pm     Technical Session 17: Offloading

    Slack channel     Location: Lerner Hall

  • Unleashing SmartNIC Packet Processing Performance in P4

    Artifac availableArtifacts Available     

    Jiarong Xing, Yiming Qiu (Rice University), Kuo-Feng Hsu (Meta), Songyuan Sui (Rice University), Khalid Manaa, Omer Shabtai, Yonatan Piasetzky, Matty Kadosh (Nvidia), Arvind Krishnamurthy (University of Washington), T. S. Eugene Ng, Ang Chen (Rice University)

    • Abstract: SmartNICs are on the rise as a packet processing platform, with the trend towards a uniform P4 programming model. However, unleashing SmartNIC packet processing performance in P4 is a formidable task. Traditional SmartNIC optimizations rely on low-level program tuning, but P4 abstractions operate at one level above. At the same time, today’s P4 optimizations primarily focus on resource packing rather than performance tuning. We develop Pipeleon, an automated performance optimization framework for P4 programmable SmartNICs. We introduce techniques that are tailored to the performance characteristics of SmartNICs, and further leverage dynamic workload patterns for profile-guided optimization. Pipeleon pinpoints program hotspots at the P4 level and computes runtime optimization plans to specialize the program layout based on the latest profile. We have prototyped Pipeleon and applied it to optimize two popular P4 SmartNICs—Nvidia BlueField2 and Netronome Agilio CX—as well as a software SmartNIC emulator extended based on BMv2. Our results show that Pipeleon significantly improves SmartNIC packet processing performance in realistic scenarios.


  • Memory Management in ActiveRMT: Towards Runtime-programmable Switches

    Artifac availableArtifacts Available     

    Rajdeep Das, Alex C. Snoeren (UC San Diego)

    • Abstract: A wide variety of in-network services have been developed for RMT-based switching hardware, almost exclusively through the P4 language and ecosystem. Many of these applications maintain state in switch memory, a scarce shared resource. As with any other network resource, varying traffic demands necessitate reallocations, yet the P4 ecosystem is not well suited for dynamic resource management: Modifying the set of services deployed on a switch using P4 requires the network operator to prepare a new binary image and re-provision the switch, disrupting all existing traffic. We present an alternate approach—using techniques from capsule-based active networking—to programming RMT devices that enables non-disruptive (re)allocation of switch memory at time scales that are much faster than P4 compilation without operator intervention. We use P4 to implement a single, shared runtime on commodity RMT hardware that interprets instructions received via the switch data plane to deliver a variety of exemplar services including caching, load balancing, and network telemetry. Our prototype implementation is able to dynamically provision dozens-to-hundreds of instances of simultaneous stateful services at the timescale of seconds.


  • Cowbird: Freeing CPUs to Compute by Offloading the Disaggregation of Memory

    Artifac availableArtifacts Available     

    Xinyi Chen, Liangcheng Yu, Vincent Liu (University of Pennsylvania), Qizhen Zhang (University of Toronto and Microsoft)

    • Abstract: Memory disaggregation allows applications running on compute servers to expand their pool of available memory capacity by leveraging remote resources through low-latency networks. Unfortunately, in existing software-level disaggregation frameworks, the simple act of issuing requests to remote memory—paid on every access—can consume many CPU cycles. This overhead represents a direct cost to disaggregation, not only on the throughput of remote memory access but also on application logic, which must contend with the framework’s CPU overheads. In this paper, we present Cowbird, a memory disaggregation architecture that frees compute servers to fulfill their stated purpose by removing disaggregation-related logic from their CPUs. Our experimental evaluation shows that Cowbird eliminates disaggregation overhead on compute-server CPUs and can improve end-to-end performance by up to 3.5x compared to RDMA-only communication.


  • Understanding the Micro-Behaviors of Hardware Offloaded Network Stacks with Lumina

    Zhuolong Yu (Johns Hopkins University), Bowen Su (Peking University), Wei Bai, Shachar Raindel (Microsoft), Vladimir Braverman (Rice University), Xin Jin (Peking University)

    • Abstract: Hardware offloaded network stacks are widely adopted in modern datacenters to meet the demand for high throughput, ultra-low latency and low CPU overhead. To fully leverage their exceptional performance, users need to have a deep understanding of their behaviors. Despite many efforts on testing software network stacks, hardware network stacks impose unique challenges to testing tools due to their kernel bypass nature and high performance.

      In this paper, we present Lumina, a tool to test the correctness and performance of hardware network stacks. Lumina leverages network programmability to emulate various network scenarios at line rate. With user-friendly interfaces, Lumina enables developers to inject deterministic events, thus facilitating the development of precise and reproducible tests. Given the limited resource and flexibility of programmable network devices, we mirror all the packets to dedicated servers and dump them for offline analysis. We leverage Lumina to test four RDMA NICs from NVIDIA and Intel, and identify bugs that can significantly degrade performance or mislead network operations. Lumina also enables us to capture unexpected micro-behaviors which are missing or not clearly described in public documents and specifications. Vendors have confirmed the critical bugs we discovered and will include bug fixes in future releases.


  • End of the day