The duality between Message routing and peer to peer data replication Rama Ramasubramanian Message routing Delay tolerant networks Mobile ad hoc networks P2p overlay networks Data replication Bayou Cimbiosys PRACTI Intersection? Overview of message routing Routing is about sending messages from source to destination Network consists of mobile/stationary nodes Nodes may be unavailable partitioned Contemporaneous end-to-end path from source to destination may or may not exist Message routing systems Delay/disruption tolerant networks (DTNS) Etc. Goals of message routing Maximize delivery success Minimize latency Minimize overhead Common features in routing protocols Goals: Maximize delivery -Store and forward: buffer messages, waiting for future delivery opportunities -Retransmission: transmit messages multiple times to improve delivery success -Multi-path forwarding: replicate messages at intermediate nodes to exploit alternative paths Goals: minimize delay -Path selection: select messages along “optimal” paths selected by the routing policy Goals: minimize overhead -Duplicate suppression: remember ids of previously received messages -Time-to-live: impose a hard limit on the number of transmissions The duality Message routing P2P data replication (Cimbiosys!) Partition/fault tolerance <-> Disconnected operation Delivery success <-> Eventual consistency Path selection <-> Partial replication Duplicate suppression <-> At-most-once delivery Bayou, PRACTI provide a subset of these properties but could be extended to provide all of them. Cimbiosys definitions Item: data object to be replicated Filter: predicate over contents of items specifying which items a node is interested in Knowledge: metadata describing items known to or stored by a node Figure: Cimbiosys sync Target node Source node Target node sends a Knowledge and filter to the source node (which are the items that the target is interested in and send it to it) Source nodes sends items back matching the filter Key properties of Cimbiosys (additional points about key properties) Connectivity independence -a node can exchange items with any other node it encounters Compact knowledge representation -overhead of knowledge exchange small -knowledge state independent of number of items Eventual filter consistency -every node eventually stores exactly the items which match its filter Message routing as a Cimbiosys app Messages are replicated data items Filter specifies the messages node is interested in receiving: -i.e., msg.to = -Captures both unicast and multicast semantics Routing heuristics implemented as extensions to filters Figure: Sync with routing enabled (3 changes to add routing) Add routing state inside Cimbiosys (policy, state) Sync protocol is modified in a slight way, as usual it sends the knowledge and filter, but in addition it send optional routing data. With the help of this routing data that the target sends, and the source’s local state, the source sends back the messages matching the filter + others chosen by the policy. Implementing DTN policies Protocol Routing Epidemic routing Spray and wait (will use this as an example) PROPHET Maxprop Others: MANET (AODV & OLSR) and P2P (Gnutella) in the paper. The paper also outlines how each of these are implemented in the same framework Figure: Example Conclusions: Peer-to-peer data replication provides several key properties desirable in message routing Message routing systems can be implemented on partial p2p data replication systems DTN routing implemented on Cimbiosys Q/A: Q: Do you see Cimbiosys evolving to be interoperable with the other DTN things? A: I think the protocol implementation things can actually use some of the standards in the DTN things, I think we can make them interoperable… Q: Think you will? A: We will hopefully do that, it depends… the interest in Microsoft in DTNs is a little less, but if it is a good collaboration. Q: In the example, suppose node1 already had a copy of message2.. it keeps would it then get any copies of message2 at all, or would there simply be no data transferred.. A: There would be no data transferred. The knowledge is a set of messages that the node already knows. Q: Didn’t you say that the knowledge is independent of the rep in the system. A: The size is independent of the knowledge in the system. Imagine each message has a sequence number, so the process of synchronization means that the node needs to keep only the latest sequence numbers, keeping the latest state, with a vector describing the state. Q: Message1 had 6 copies, divided by 2… so clearly the original number in that case was not 2^n, so who decides the original number? A: The routing policy decides this. Q: Spray and Wait example- are you implying that there are two p2p systems running one for the routing state and one for the messages or is the message and routing distribution integrated? A: So there is an item store here which is storing all of the messages, buffering them and the routing policy is a filter dictating what messages are routed. The process has access to this data so it can cache it. Q: So, the messaging app may have a very different filter than the routing… A: The way the system is layered, is that the routing protocol appears as an application to cimbiosys. So the ‘application’ of the routing protocol is going to keep the state and the knowledge store. Q: Is there every any problem reconciling these? A: Could you give an example of a problem? Q: Where there are messages changing/propagating new routing state, and messages propagating the data associated with the messaging app? A: We can treat all of them as messages that are going to be replicated by cimbiosis. Its actually kind of integrated. Most of the routing protocol itself works in very much the same way except it doesn’t have to use some of the mechanisms for data replication… etc. Q: Is there any, I can imagine almost any DTN.. you’re going to get this weird anti-locality. Because you would like to have a relatively short time distance between two nodes, but message locality is going to be very small. The animals don’t run very fast unless they’re cheetahs… (joke). Have you gotten any experience in that, where there are data mules that wander into a situation where there are a whole bunch of messages that nobody has seen? A: That can actually happen. Q: I’d be really interested in seeing traffic properties of this. A: Kevin Fall: We did that once, on a ferry… oceanographic instruments, and when it came into port it would be in radio range, all that the DTN part did was bridge UDP. Without any kind of congestion control, the first time around it kicked over the server. Q: Why didn’t you use TCP and split the connection? A: We could have done that, but the goal in that instance was to see what would happen if we didn’t modify anything in the existing system. Q: How big is cimbiosys? A: We have it working on mobile phones, but I don’t know about motes. Open vSwitch: Extending Networking into the Virtualization Layer Ben Pfaff, …, Scott Shenker Outline Virtualization and Networking Open vSwitch approach Applications Implementation Virtualization will be pervasive Gartner: 12% of workloads are virtual today: 61% by 2013 Intel: All end hosts should be virtualized. Networking in virtual environments is important One cloud is planning to run 128 VMs per host. That’s 2+ full racks in one machine. Networking in virtual environments is different Challenges: Scalability (10^5 VMs) Isolation – physically together, logically disconnected Mobility – moving virtual machines across the network, as opposed to physically. … Conveniences: Hypervisor info – which multicast each VM is listening on Introspection Leaf nodes – e.e. don’t have to run spanning tree protocol at virtual server itself. … Open vSwitch: Distribute the switch Centralized control Take advantage of conveniences Basic design (Xen): Diagram showing the design Basic design would apply to VMWare servers, and KSX running on linux Multiple VMs (user domains in Xen) on a physical host Each of these domains has their own virtual NIC Control layer (Dom0) has ovs-vswitchd which each virtual NIC is connected to, and the physical NICs The controller connects to the vswitchd and other xen hosts and the administrative CLI/GUI What does open vSwitch do? Connects to a controller: -Configuration -OpenFlow connection to setup flows and manage access control lists Features: -VLAN -Port mirroring -ACLs -NetFlow -Bonding -QoS -Anything* Web UI-Diagram Open vSwitch Application: Multiple distributed Switches Physical: VM Hosts 1-n connect through GRE tunnel to Physical vSwitch, and the controller manages all of this Logical: what you would expect. Open vSwitch Application: Extending Data Center into Cloud Figure: Same configuration as before, except the customer data center is completely independent. Occasionally, some of their datacenter wants to be expanded into this cloud. Tunnels GRE/IPSEC/SSL to the cloud access server, facilitated by the controller. Implementation difference: There is a Fast Path kernel module that has direct mapping flows that goes directly between the VNICs to the NICs. If the match flow doesn’t match, then ovs-vswitchd gets called to setup a new direct mapping into the fast path for future use. Open vSwitch is Fast As fast as Linux bridge with same CPU usage Bandwidth: Fast Path: > 1 Gbps Ovs-vswitchd (worst case every packet has to go to user space): 100 Mbps Controller (worst worst case): 10Mbps Latency: Fast Path: < 1us Ovs-vswitched: < 1 ms Controller: ms Graph showing bandwidth of various sized transfers. Hardware acceleration: Inevitable Netronome: right approach VN-Tag: wrong approach VEPA: powerless Future Directions: Physical switches Upstream kernel integration Anything* Q/A: Q: Can you speak a little bit to the benefit you get in terms of isolation, in terms of Open vSwitch vs the Xen bridge A: We have a controller, and you can tell the controller what are the groups that are allowed to talk to each other, first of all the linux bridge can’t be configured that way, basically the feature just isn’t there. Q: Given that you’re running 128VMs … why run so many machines on each host? A: Cheapest way to run a cloud, usually the design of these things is to get as many customer dollars per your dollar. Q: That makes each VM very very weak right? A: I think these are 8CPU systems with some number of cores per CPU, and a ridiculous amount of memory 16-32GB? I’m sure that if it were cheaper to do 2 CPU systems with 32 VMs they would do it that way. Q: Would there be a performance difference? A: I doubt that there would be a performance difference, our internal development is on 2 and 4 CPU systems. All the implementation is almost all read only access, not a lot of locks or anything, so I wouldn’t expect there to be a big performance difference. Q: In a lot of those applications, you make the intelligent decisions right at the edge, and as soon as its made by the first switch… as soon as you pick the core switch, everything else follows from the spanning tree. It seems that if you have this and your edge switch is right on the edge host itself, you can have dumb iron in the rest of the network. A: I agree, my guess is that within a few years, openFlow is just going to be checklist feature on switches. If that’s true, there won’t be any more dumb switches. You can do it without smart switches in the core. Q: The point is that right now the hit of going through the controller is something on the order of 10ms. I think some of the special purpose implementations do a little bit better. So you don’t want to do it if you can avoid it. But the next question is, assume these switches have openFlow, so you don’t have to go through the slow path as often. A: If you have a physical switch with openFlow on it, I don’t think there is going to be an advantage. The reason we are doing it this way is because we don’t have that. So basically, we’re just building our own switches. It might be better to juse use the dumb switches and setup the headers and not use openFlow at all. Q: You mentioned that broadcomm was looking at this? A: There was a guy in the openFlow forums implementing openFlow on the Broadcomm hardware. Q: There are implementation details about how companies like Amazon, and Arista are working together, but is there more difference in the network design. A: You mean how much diversity is there between these designs? The one thing I know is that some of them are doing switching and some of them are doing routing, but I don’t know enough about the details about how they’re doing it to say anything more. Q: But most of them are using existing, but you’re not necessarily inventing new wrappers or shim layers or anything. A: I prefer not to invent new network protocols if I don’t have to because then you have a chance to work with existing hardware. Q: Compare to Flowvisor? A: System for making one openflow switch look like multiple independent openflow switches, its not as far as I know distributed or have any of the manageability features that open vSwitch does. I presume you would designate some field on which you’re slicing your network, ie. A VLAN or IP switch or whatnot. For keeping experiments separate. Its not only one field you can slice… config file that looks like a set of firewall rules.