Session 2: Datacenter Network Design (Chair: Stefan Saroiu, Microsoft Research) Scribe: alexandretian@gmail.com ==== PortLand: A Scalable Fault-Tolerant Layer 2 Data Center Network Fabric Radhika Niranjan Mysore (University of California San Diego), Andreas Pamboris (University of California San Diego), Nathan Farrington (University of California San Diego), Nelson Huang (University of California San Diego), Pardis Miri (University of California San Diego), Sivasankar Radhakrishnan (University of California San Diego), Vikram Subramanya (University of California San Diego), Amin Vahdat (University of California San Diego) ----------------------- Future data centers have millions of virtual end points, it is expected that those data center network architects would have plug-and-play deployment for switches. Problem: Current network protocols impose significant management overhead at millions scale. A unified plug-and-play large-scale network fabric is unachievable yet. Key insight: Data center networks are often physically inter-connected as a multi-rooted, multi-level tree. Concerns: How is the story about Server-Centric topologies like DCell and BCube? Solution: Assigns internal Pseudo MAC (PMAC) addresses to all end hosts to encode their position in the topology. A set of Ethernet-compatible routing, forwarding, and address resolution protocols are developed. Q: Did you just mentioned unmodified switch and end host? A: In fact we did modify switch software. Q: In case we have DHCP deployment, can these two dynamic allocation mechanism co-exist? A: Portland is in fact DHCP in layer 2, they can co-exist. Q: You already have Fabric Manager, why you still need Location Discovery Message? A: Fabric Manager maintains only soft-state of switches. We do not expect a centralized heavy Fabric Manager mechanism. Q: Is multi-path routing supported in Portland? A: I think so. If it is hashing MAC addresses, we can do something similar to Portland. Q: Have you implemented the LDM? How is the performance? A: It is 1 second to discover now. We think it is the problem of implementations, and we are trying to improve it. Q: Do you need ordered switch boot-up? A: No, can be booted in any order. Q: What you have presented is closely coupled to Fat-tree, do you think it is a limitation? A: I have emphasized from the beginning that Portland applied to any multi-rooted, multi-level tree. ==== VL2: A Scalable and Flexible Data Center Network Albert Greenberg (Microsoft Research), Navendu Jain (Microsoft Research), Srikanth Kandula (Microsoft Research), Changhoon Kim (Princeton), Parantap Lahiri (Microsoft Research), David A. Maltz (Microsoft Research), Parveen Patel (Microsoft Research), Sudipta Sengupta (Microsoft Research) Data centers that hold tens to hundreds of thousands of servers must achieve the property of agility to assign any server to any service. Problem: Existing architectures do not provide enough bisectional throughputs between the servers; services are not isolated from each other's impact; fragmentation of the address space limits the migration of virtual machines. Solution: Give each service the illusion that all the servers assigned to it are connected by a single Virtual Layer 2 non-interfering Ethernet switch. VL2 are built from low-cost switches arranged into a Clos topology and Valiant Load Balancing spreads traffic across all available paths; VL2 use Location Address and Application-specific Address to separate the host identity and its topology location. Q: What's the price of VL2 implementation? A: We are using SOC solution for VL2. Q: What about middle box, Qos constraints and load balance in the data center? Have you considered their impact? A: We are working towards these directions. Q: Is the topology flexible? Bandwidth over-subscription may be reasonable in Data center. A: You can also do over-subscription in VL2. Q: Why only two uplinks from ToR to Aggr switch? A: We use the low-end ToR switch which only has two 10G uplink Q: Suppose Map-Reduce and Dryad are used, do you consider their data localities aware task scheduling? A: Why they need localities? It is because network is the bottleneck. VL2 is supposed to handle this bottleneck and provide enough bisectional throughputs. ===== BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers Chuanxiong Guo (Microsoft Research Asia), Guohan Lu (Microsoft Research Asia), Dan Li (Microsoft Research Asia), Haitao Wu (Microsoft Research Asia), Xuan Zhang (Tsinghua University), Yunfeng Shi (Peking University), Chen Tian (Huazhong Universtiy of Science and Technology), Yongguang Zhang (Microsoft Research Asia), Songwu Lu (UCLA) Shipping-container based, modular data center (MDC) presents new research opportunities for data center networking. Problem: The new fabric should support bandwidth-intensive applications better, be built by low-end commodity switches and enable graceful performance degradation. Solution: BCube, closely related to the generalized Hypercube, is a high-performance and robust network architecture for an MDC. Both oblivious and source routing are provided in the protocol suite. Q: You use servers to forwarding the packets, is there any latency added? A: We have two implementations. The forwarding card is based on NetFPGA and no more latency is introduced in hardware forwarding. The software implementation did introduce in latency, and the scale is tens of m-seconds. Q: BCube is selling the idea that involves servers in packet forwarding. How about the energy saving target? You can't shutdown duty-free servers to save power? A: We actually try to sell the idea of put routing intelligence in the servers. For power saving, we can put the servers in sleep mode and keep the NICs on. Q: The BCube topology is very similar to a Hyper-Cube? A: We have stated in the paper that BCube is a special case of generalized Hyper-Cube. But there is no switch in generalized Hyper-Cube. We use switches to reduce the number of cables. Q: You use BCube to provide high bandwidth inside a Container; do you have a plan to extend the work to provide high bandwidth between Containers? A: It is hard to provide the same high bisectional bandwidth between Containers; the designs should be a trade-off. Q: This question is not specified to BCube, but to all papers in this session. I think the problems you mentioned have been addressed in HPC field for a long time, what is the difference? A: There are two differences between new data center networking and HPC. First, all three papers try to networking by using commodity switches. Second are the skills. We want to interconnect millions of individual servers and provide high bisectional bandwidth for them, while HPC only connects thousands of CPUs.