Industry Spotlight Talks
Researcher at Google NetInfra
Sundial: Fault-tolerant Clock Synchronization for Datacenters
Abstract: Clock synchronization is critical for many datacenter applications such as distributed transactional databases, consistent snapshots, and network telemetry. As applications have increasing performance requirements and datacenter networks get into ultra-low latency, we need submicrosecond-level bound on time-uncertainty to reduce transaction delay and enable new network management applications (e.g., measuring one-way delay for congestion control). The state-of-the-art clock synchronization solutions focus on improving clock precision but may incur significant time-uncertainty bound due to the presence of failures. This significantly affects applications because in large-scale datacenters, temperature-related, link, device, and domain failures are common. We present Sundial, a fault-tolerant clock-synchronization system for datacenters that achieves ~100ns time-uncertainty bound under various types of failures. Sundial provides fast failure detection based on frequent synchronization messages in hardware. Sundial enables fast failure recovery using a novel graph-based algorithm to precompute a backup plan that is generic to failures. Through experiments in a >500-machine testbed and large-scale simulations, we show that Sundial can achieve ~100ns time-uncertainty bound under different types of failures, which is more than two orders of magnitude lower than the state-of-the-art solutions. We also demonstrate the benefit of Sundial on applications such as distributed databases and congestion control.
Speaker Bio:Yuliang Li works at Google NetInfra, focusing on high-performance network transport and clock synchronization. His PhD dissertation won the SIGCOMM Dissertation Award and Harvard CS Outstanding PhD Dissertation Award. Prior to Harvard, he received his Bachelor's degree from Tsinghua University.
Staff engineer and R&D manager at Alibaba
GSO-Simulcast: Global Stream Orchestration in Dingtalk’s Video Conferencing
Abstract: In recent years, video-conferencing, along with the commercial success of many services like Zoom, Microsoft Teams, and Dingtalk, has emerged as the beating heart of remote collaboration, learning, and personal interaction for billions of people around the globe. In this talk, I will present GSO-simulcast, the new simulcast architecture that is currently running in Dingtalk, serving more than 500 million users. GSO-Simulcast marks a fundamental shift from today’s Simulcast architecture, whose video adaptations are decided based on a fragmented network view using empirical stream policies. Instead, GSO-Simulcast globally orchestrates the publishing, subscribing, as well as the resolution and bitrate of video streams for each participant using an automated centralized controller that is aware of all network constraints in a meeting. Since its launch in 2021, GSO has shown significant improvement in all core video conferencing metrics, with more than a 35% reduction in the video stall, a 50% reduction in the voice stall, a 6% improvement in the video framerate, and most importantly, a 7% improvement in the users’ satisfaction score. I will describe the principle, deployment, and lessons learned.
Speaker Bio: Yunfei Ma is a staff engineer and R&D manager at Alibaba, where his current research focuses on video conferencing, video transport, and QUIC-based Cloud gateway. Before joining Alibaba, He was a postdoctoral researcher at MIT Media Lab. He received Ph.D. in ECE from Cornell University and B.S. from USTC. His research has been deployed in Alibaba’s core services, such as Taobao and Dingtalk, and has transformed into several products at AliCloud. He has published more than 10 papers on SIGCOMM, MOBICOM, and NSDI, and he holds more than 15 granted patents. His works have also been covered by a number of media outlets including BBC, The Verge, MIT Technology Review and IEEE Spectrum. He served on the TPC of ACM CoNEXT 2018, IEEE INFOCOM 2020/2019/2018 and IEEE Globecom 2021.
Researcher in Tencent
Network Telemetry in Optical Backbones
Abstract: Degradation or failure events in optical backbone networks affect the service level agreements for cloud services. It is critical to detect and troubleshoot these events promptly to minimize their impact. Existing systems rely on arcane tools (e.g., SNMP) and vendor-specific controllers to collect optical data, which affects both the flexibility and scale of these systems. As a result, they fail to collect the required data on time to detect and troubleshoot degradation or failure events in a timely fashion. In this talk, I will present OpTel, an optical telemetry system that uses a centralized vendor-agnostic controller to collect optical data in a streaming fashion. As a result, OpTel enables the collection of fine-grained optical telemetry data at one-second granularity. By running OpTel in Tencent for about two years, it helps detect more 2× more optical events, including the large number of short-lived events (i.e., ephemeral events). OpTel also enables troubleshooting of these optical events in a few seconds, which is orders of magnitude faster than the state-of-the-art. By sharing our operational experiences for years, we hope to inspire more research in the optical network area.
Speaker Bio: Congcong Miao is currently a researcher at Tencent. Before joining Tencent, he received Ph.D. in Computer Science and Technology from Tsinghua University. He works on both underlay and overlay network and his research interests are in Optical Network, Network Management, Network Verification, SDN/NFV, etc. He has published several papers on top CS conferences, including NSDI, MOBICOM, CHI.
Director of Data Communication Research Dept, Huawei
Technology Trends in Data Communication: a Huawei's View
Abstract: Review the history of data communication and look ahead to future technology trends in data communication, highlights Top10 technology trends.
Speaker Bio: Director of Data Communication Research Dept; Director of Data Communication Technology Planning Dept. He was the Chinese President of Tel Aviv Research Center, and have 16+ years experience of data communication.
Senior Engineering Manager in Meta
Scalable, Flexible and Agile Network Planning
Abstract: Network planning facilitates a healthy and sustainable network. However, its practice is not well understood outside the network engineering community. In this talk, I will present Meta’s network planning softtware suit, ranging from risk-driven network assessment,hose based network planning and ultimately to service entitlement. I will present our intent-driven top-down network planning automation that facilitates the network reliability, performance, and long-term evolution. By sharing our operational experiences of running a large-scale network for years, as well as our experiences in supporting a diverse set of networks from backbone, to PoP, to data center, we hope to inspire new research in the area of more consolidated service and network joint optimization.
Speaker Bio: Ying Zhang is Senior Engineering Manager in Meta. She works on large scale network management problems. Her team owns software for network modeling, design, configuration and verification for Meta Datacenter, backbone and edge networks. Her research interests are in network management, Network for AI, Software-Defined Networks, Network Function Virtualization, Internet routing, and network security. She has 80+ granted US/International patents, 100+ peer-reviewed publications with about 4000+ citations, and she was named by Swedish media as Mobile Network 10 Brightest Researcher. She was awarded as a Rising Star in the Networking and Communications area.
Researcher at Microsoft (Azure for Operators); IEEE Fellow
Cloudification of Telecom Network Infrastructure
Abstract: Mobile cellular network is fundamentally a real-time distributed computing problem. On one direction, the input is the wireless signals of our mobile devices picked up by one or more cell towers. The output is the IP packets destined to the mobile app backends on the cloud. Inside this black box is massive amount of complex computation and data transactions that accomplish what a traditional 4G or 5G network would do in radio access and packet core. The scale is enormous – 10s of millions of cell towers spreading across the world, each potentially producing as much as 100s Gbps. These computation and processing generally take place across distance and have real-time constraints (sometimes sub-millisecond). This is a tremendous real-time distributed system problem. With recent advances in computing and networking technologies, we have an opportunity to rethink and reconstruct the foundation infrastructure of our cellular networks. At Microsoft, we have formed a new product group called Azure for Operators to build such infrastructure for mobile operators. We are building distributed carrier-grade cloud services that extend from cloud regions to near edges, far edges, and cell sites. Working with traditional telecom solution providers, we are disaggregating and reimagining cellular networks as cloud-native functions. For example, 5G Core network is now a set of cloud services on Azure regions or edges, and the radio access network (RAN) is being built on top of Azure distributed services. Further, we are adding cloud data analytics, AI/ML, and programmability into RAN that will transform how operators operate their networks. Working with early adopters like AT&T, we are laying grounds for the world’s future mobile networks. We look forward to working with networking research community to explore and reshape this new paradigm of cellular network building.
Speaker Bio: Yongguang Zhang is a researcher at the Office of CTO, Azure for Operators at Microsoft. Previously, he led the Microsoft Research Special Project “Cloudification of Telecom Network Infrastructure” that led to Microsoft entering this business. Earlier, he was a Research Manager for the Wireless and Networking group at Microsoft Research Asia, where he and his team built up one of the best networking research groups in the world. Before that, he was with HRL Labs leading several DARPA-funded networking and wireless research projects. He received his Ph.D. in computer science from Purdue University in 1994. He is a recipient of USENIX Test of Time Award at NSDI’19, a General Co-Chair for ACM MobiCom’09, and an IEEE Fellow.