5th Asia-Pacific Workshop on Networking (APNet 2021)

June 24-25 2021, Shenzhen, China (mix-mode online)

APNet SIGCOMM/NSDI Talks

Junchen JIANG

Assistant Professor at The University of Chicago

Talk Title:

Perception-Driven Optimization: A New Frontier for Scaling Internet Applications

Abstract: A computational approach to human perception has sparked breakthroughs in AI, human-computer interaction, and social cognition. In this talk, I will introduce attention-driven resource allocation, a transformative paradigm that improve perceived quality for human users (e.g., web services, VR streaming) as well as emerging AI applications (e.g., real-time video analytics). In contrast to today's network systems that optimize handcrafted performance metrics, we advocate that network systems should align resource allocation with user attention. The insight is that video viewers or web users rarely perceive application quality by consciously measuring metrics such as video rebuffering time or web page load time; instead, they react the visual process of video rendering or page loading. Similarly, computer vision applications, though highly data-/compute-intensive, do not require high quality data everywhere or all the time. This talk will use a few case studies (SIGCOMM'19, SIGCOMM'20, NSDI'21) to present early attempts and promises of attention-driven systems.

Speaker Bio: Junchen Jiang is an Assistant Professor of Computer Science at the University of Chicago. He received his PhD degree from Carnegie Mellon University in 2017 and his bachelor's degree from Tsinghua Univ in 2011. His research interests are networked systems and their intersections with machine learning. He is a recipient of Google Faculty Research Award in 2019, best Paper Award at the Symposium of Edge Computing in 2020, and CMU Computer Science Doctoral Dissertation Award in 2017.

Xin JIN

Associate Professor at Peking University

Talk Title:

Software-Defined Self-Driving Cloud Networks

Abstract: The emergence of programmable networks and the breakthroughs in artificial intelligence make it possible to build software-defined self-driving cloud networks. In this talk, I will present two recent projects towards this vision. In this first part of the talk, I will present Admission-In First-Out (AIFO) queues, a new solution for programmable packet scheduling that uses only a single first-in first-out queue. AIFO is motivated by the confluence of two recent trends: shallow buffers in switches and fast-converging congestion control in end hosts. The core idea of AIFO is to maintain a sliding window to track the ranks of recent packets and compute the relative rank of an arriving packet in the window for admission control. We demonstrate that AIFO closely approximates PIFO, runs at line rate on existing hardware, and uses minimal switch resources—as few as a single queue. In the second part of the talk, I will present NeuroPlan, a deep reinforcement learning (RL) approach to solve the network planning problem. We use a graph neural network (GNN) and a novel domain-specific node-link transformation for state encoding, in order to handle the dynamic nature of the evolving network topology during planning decision making. We leverage a two-stage hybrid approach that first uses deep RL to prune the search space and then uses an ILP solver to find the optimal solution. We demonstrates that NeuroPlan scales to large topologies beyond the capability of ILP solvers, and reduces the cost compared to hand-tuned heuristics.

Speaker Bio: Xin Jin is an Associate Professor in the Department of Computer Science and Technology at Peking University. His research is in computer networks and computer systems, with a focus on software-defined datacenters, programmable networks, and cloud computing. He received his BS in computer science and BA in economics from Peking University in 2011, and his MA and PhD in computer science from Princeton University in 2013 and 2016. He has received many awards and honors, including USENIX FAST Best Paper Award (2019), USENIX NSDI Best Paper Award (2018), Amazon AWS Machine Learning Research Award (2019), Google Faculty Research Award (2019), and Facebook Communications & Networking Research Award (2018).

Bojie LI

Senior Engineer at Huawei

Talk Title:

When In-Network Processing Meets Distributed Systems

Abstract: Recent years witness a trend of co-designing distributed systems with programmable network switches and NICs. This talk will first briefly revisit the literature. Next, this talk will introduce our SIGCOMM ‘21 work 1Pipe, a novel communication abstraction that enables different receivers to process messages from senders in a consistent total order. 1Pipe utilizes programmable network switches to provide an unreliable ordered service, as well as a reliable ordered service that guarantees delivery and provides restricted atomic delivery for each group of messages. 1Pipe can simplify and accelerate many distributed applications, achieving linearly scalable throughput and low latency in transactional key-value store, distributed transactions, remote data structures, and replication that outperforms traditional designs by 2∼20x. Finally, we will propose two open questions. First, in-network processing increases heterogeneity of data access methods including coherent LD/ST, one-sided RDMA, switch offloading, SmartNIC offloading, and RPC. Multi-tenancy further complicates the problem. Second, there is a tension between distributed applications and OS mechanisms such as virtual memory and event notification, where SmartNICs may help.

Speaker Bio: Bojie Li is a Senior Engineer with Computer Network and Protocol Lab in Huawei 2012 Labs, currently working on next-generation data center interconnects. In 2019, Dr. Li got his Ph.D. in Computer Science from University of Science and Technology of China (USTC) and Microsoft Research Asia (MSRA). He has published papers on SIGCOMM, SOSP, NSDI, ATC, and PLDI, and has received the 2019 ACM China Doctoral Dissertation Award.

Fengyuan REN

Professor at Tsinghua University

Talk Title:

Congestion Detection in Lossless Networks

Abstract: Congestion detection is the cornerstone of end-to-end congestion control. Through in-depth observations and understandings, we reveal that existing congestion detection mechanisms in mainstream lossless networks (i.e., Converged Enhanced Ethernet and InfiniBand) are improper, due to failing to cognize the interaction between hop-by-hop flow controls and congestion detection behaviors in switches. Specifically, the ON-OFF sending pattern can impose unexpected effects on congestion detection behaviors, including causing queue buildup and affecting the real input rate of pausing ports. We define ternary states of switch ports (congestion, non-congestion and undetermined ) and present Ternary Congestion Detection (TCD) for mainstream lossless networks. Testbed and extensive simulations demonstrate that TCD can detect congestion ports accurately and identify flows contributing to congestion as well as flows only affected by hop-by-hop flow controls. Meanwhile, we shed light on how to incorporate TCD with rate control. Case studies show that existing congestion control algorithms can achieve 3.3X and 2.0X better median and 99th-percentile FCT slowdown by combining with TCD.

Speaker Bio: This presentation was delivered by Fengyuan REN's student Yiran Zhang on his behalf. Yiran Zhang is a fourth-year PhD student at Tsinghua university, advised by Prof. Fengyuan Ren. Before that, she received her B.S. degree in computer science from BUPT in 2017. Her current research interests include datacenter networking and congestion control.

Xiaoliang WANG

Associate professor at Nanjing University.

Talk Title:

Automatic ECN Tuning for High-Speed Datacenter Networks

Abstract: We report the design and deployment of an automatic run-time optimization scheme for operating the datacenter networks. For the widely employed congestion control schemes, ECN is the key to deliver high bandwidth and low latency. However, due to the traffic dynamics, it is difficult to determine the marking threshold in high-speed production network and achieve this goal through the conventional static ECN. To meet the operational challenge, we introduce ACC, an automatic in-network ECN tuning approach. ACC leverages the multi-agent reinforcement learning technique to dynamically adjust the marking threshold at each switch. By addressing the practical challenges such as scalability through distributed DRL agents, deployment based on the commonly supported features of commodity switching chips, ACC is able to maintain short in-networking queuing for diverse traffics. Both testbed experiments and large-scale simulations have shown that ACC improves the flow completion time (FCT) of both mice flows and elephant flows at line-rate. Under heterogeneous production environments with 300 machines, compared with the well tuned static ECN setting, ACC achieves up to 20% improvement on IOPS for storage service and 30% lower FCT. Moreover, ACC simplify the network operations.

Speaker Bio: Xiaoliang Wang is currently an associate professor at the Department of Computer Science and Technology in Nanjing University. Dr. Wang got his Ph.D. from Tohoku University, Japan. In 2014-2015, he was a visiting researcher in Microsoft Research Asia (MSRA). In 2019 – 2020, he works as a Consultant at Network Platform Department, Tencent. He has published papers on SIGCOMM, NSDI, FAST and ATC, and has received the 2017 APNET Best Paper Award.

Wenfei WU

Assistant Professor at Tsinghua University

Talk Title:

ATP: In-Network Aggregation for Multi-Tenant Training

Abstract:In Deep Neural Network (DNN), the size of the model and dataset is increasing, and the DNN training tends to be implemented in a distributed architecture. The PS-worker architecture for DNN systems suffers from the traffic incast problem, where many workers exchange traffic with the PS, causing the PS to be the bottleneck. Inspired by the recent progress in programmable switches, we propose an Aggregation Transmission Protocol (ATP), which supports multi-tenant and multi-rack in-network aggregation for DNN training. ATP consists of the networking stack on end hosts and the aggregation service on switches. The switch allocates its computation resources to jobs in a decentralized manner. The end host networking stack has a fallback to complement the switch's corner-case incapability(e.g., overflow, packet loss) and congestion control to share network resources. Finally, we made a bunch of engineering optimizations to make ATP saturate the high-bandwidth network (100Gbps). We wrap up ATP as a primitive in the transport layer and integrate it with ML systems, and show that ATP can provide both performance gain and correctness to typical DNN training (e.g., AlexNet, VGG, ResNet).

Speaker Bio: Wenfei Wu is an assistant professor at the Institute for Interdisciplinary Information Sciences (IIIS) in Tsinghua University. Dr. Wu got his Ph.D. from the University of Wisconsin-Madison in 2015, and his research interest is in networked systems. Dr. Wu has published 33 papers on top-tier conferences and transactions. His Ph.D. work is about virtual network diagnosis, which was awarded the best student paper in SoCC'13. His work on 5G transport layer design was also awarded the best paper runner-up in IPCCC'19. Dr. Wu's recent work is to build high-performance infrastructures for machine learning systems. This piece of work is accepted by NSDI'21 and would be released to the academia.Title

Tong Yang

Associate Professor at Peking University

Talk Title:

Lightweight and Elastic Network Measurement Algorithms and Systems

Abstract: Network traffic measurement is central to successful network operations, especially for today’s hyper-scale networks. Existing works have made great contributions, and the most promising solution is using sketches. Recent sketch works aim to meet the following three criteria: 1) full-visibility, which refers to the ability to acquire any desired per-hop flow-level information for all flows; 2) low overhead in terms of computation, memory, and bandwidth; 3) elastic measurement: the measurement should be adaptive to traffic characteristics including available bandwidth, packet rate, and flow size distribution; 4) robustness, meaning the system can survive partial network failures. With regards to these criteria, we present the typical sketch solutions, including the most classic sketches and the recent sketch works in conferences of SIGCOMM/NSDI/SIGMOD/SIGKDD.

Speaker Bio: Tong Yang received his PHD degree in Computer Science from Tsinghua University in 2013. He visited Institute of Computing Technology, Chinese Academy of Sciences (CAS) China from 2013.7 to 2014.7. Now he is an associate professor in the Department of Computer Science and technology, Peking University. His research interests focus on networking algorithms, such as sketches, IP lookups, Bloom filters. He published a dozen of papers in SIGCOMM, NSDI, SIGKDD, and SIGMOD.