Abstract: this presentation will introduce Alibaba DC network first, including the architecture evolution in the last 10 years, and Alibaba HAIL DC network architecture. After that, it will talk about the new challenging of the DC network for the next decade, including new scale, new workload, and new application requirements. At the end, it will share the vision of building DC network for the next 10 years: the predictable DC network
Speaker Bio: Dennis Cai, Chief Architect of Alibaba Cloud network infrastructure and the head of Alibaba high performance networking group. He is responsible for the architecture and the future evolution of the Alibaba Cloud network, including data center, metro, backbone, and edge network. He is also driving the evolution of high performance DC network architecture and technologies, such as RDMA, advanced network congestion control, etc. Before joining Alibaba, Dennis was a Distinguished Engineer at Cisco.
Abstract: Latest trends of global 5G launch, microservices adoptions and exponential growth of remote work-force due to Covid-19 are moving computation powers out of enterprise data centers on to multiple clouds and edges. Consequently, the surface prone to attack is growing astronomically in the years ahead. Traditional network access control implementations are usually fragmented, lacking single source of truth for ongoing verification, hard to automate and very costly to maintain. Cisco has developed an end-to-end zero trust network security framework, based on programmable infrastructure and machine learning techniques, to enforce and to continually assure security policies are in place from work-force, via work-place to ultimately work-loads.
Speaker Bio: Philip Wong is the Principal Solution Architect of Cisco Greater China, leading Multi-Cloud & Zero Trust Security solution development endeavors in the region. Since joined Cisco in 2006, Philip has been an active practitioner and thought leader in SDN, IoT and network security. Before joining Cisco, Philip had 15 years+ experience in delivering innovative banking solutions during his tenture with Microsoft and IBM HK. Philip is a vintage computers collector and was interviewed by CNN, Phoenix and other newspapers on his collections.
Abstract: Deep Neural Networks (DNN) have made breakthroughs in many fields including Speech, Computer Vision, Natural Language Understanding, and Content Recommendation. The DNN models behind those breakthroughs can have tens to hundreds of layers, with hundreds of millions or even more parameters. Training those deep models costs a huge amount of computational power, and the training process lasts days to weeks. In this talk, I will introduce BytePS for distributed DNN training acceleration. BytePS provides a unified and optimal communication framework which includes Parameter Server and all-reduce as two special cases. To achieve the proved optimality in practice, BytePS further uses several key technologies including RoCEv2 acceleration and gradient partitioning and scheduling. BytePS supports major training frameworks including TensorFlow, PyTorch, and MXNet. BytePS has been widely adopted in Bytedance and other companies, for DNN training up to several hundreds of GPU cards.
Speaker Bio: Chuanxiong Guo is a director of Bytedance AI Lab. Before that he was a Principal Researcher at Microsoft Research. He is currently working on data center networking and machine learning systems. Several of his envisions including DCN virtualization, DCN monitoring, and ServerSwitch generated both academic and industrial impacts. Several of the systems that he designed and implemented, including Pingmesh and RDMA/RoCEv2 were widely adopted by the industry. He is interested in building, running, and understanding computer systems with high availability at scale.
Abstract: With more and more enterprise services are deployed on cloud, we found that the dominant solution for the cloud access scenarios, which leverages traditional commodity routers, is hard to scale and evolve to meet the new challenges at low cost including high volume traffic, massive scale routing table/ACL and fine-grained traffic engineering. To address these challenges , we split cloud router functionalities into several components including commodity-switch for underlay network inter-connection, server-based packet processing clusters for data plane, server-based routing cluster for routing protocol, hierarchical and efficient message distribution system for control plane, and programmable switches for hardware acceleration. With this new design, each component can be scaled, maintained, upgraded, bug fixed independently as needed. We managed to provide Tencent cloud a highly scalable, elastic and cost-efficienct cloud routing service.
Speaker Bio: Allen Lv is the Chief Architect of Tencent Cloud. He is working on cloud networking and data center networking.
Abstract: Azure public cloud is one of the largest in the world, and networking is its backbone. This talk will focus on two areas of critical importance for us: reliability and performance. I will first describe cutting-edge network verification technology we use to ensure that Azure Network continues to perform reliably in the face of constant churn. I will then discuss deployment of Remote Direct Memory Access (RDMA) technology in Azure. Azure now has one of the largest deployments of RDMA in the world. I will discuss how it improves performance for our customers, and lowers costs for us. I will also touch upon various lessons learnt as we deployed RDMA on a scale never before attempted.
Speaker Bio: Jitendra Padhye is a Partner Development Lead at Microsoft Azure networking. He is interested in all aspects of computer networking and networked systems. His recent work has focused on data center networks and mobile computing. He has published numerous research papers in top conferences, and holds over 25 US patents. He is the recipient of the ACM SIGCOMM’s Test of Time award. He received his PhD in Computer Science from University of Massachusetts Amherst in 2000.
Abstract: High performance computing and Artificial Intelligence are the most essential tools fueling the advancement of science. NVIDIA Networking technologies are the engine of the modern HPC data center. Mellanox HDR InfiniBand enables extremely low latencies, high data throughput, and includes high-value features such as smart In-Network Computing acceleration engines via Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) technology, high network resiliency through SHIELD’s self-healing network capabilities, MPI offloads, enhanced congestion control and adaptive routing characteristics. These capabilities deliver leading performance and scalability for compute and data-intensive applications, and a dramatic boost in throughput and cost savings, paving the way to scientific discovery.
Speaker Bio: Gilad Shainer serves as senior vice-president of marketing for Mellanox networking at NVIDIA, focusing on high- performance computing, artificial intelligence and the InfiniBand technology. Mr. Shainer joined Mellanox in 2001 as a design engineer and later served in senior marketing management roles since 2005. Mr. Shainer serves as the chairman of the HPC-AI Advisory Council organization, the president of UCF and CCIX consortiums, a member of IBTA and a contributor to the PCISIG PCI-X and PCIe specifications. Mr. Shainer holds multiple patents in the field of high-speed networking. He is a recipient of 2015 R&D100 award for his contribution to the CORE-Direct In-Network Computing technology and the 2019 R&D100 award for his contribution to the Unified Communication X (UCX) technology. Gilad Shainer holds a MSc degree and a BSc degree in Electrical Engineering from the Technion Institute of Technology in Israel.