Session 9: Performance Optimization Scribe: rtorresg@purdue.edu -------------------------------- ROAR: Increasing the Flexibility and Performance of Distributed Search Costin Raiciu (UCL), Felipe Huici (UCL, NEC Labs), Mark Handley (UCL), David S. Rosenblum (UCL) - Problem: - Many applications use distributed search due to the large dataset (e.g. web search, ebay, wikipedia) - Latency of searching on this dataset is high. It is important to make it smaller. - Traditional approach: - Split data in P ways. Store data in P servers. - When a query comes, send query to all P servers. - Data in each of the P servers is replicated in clusters of servers for reliability. - total_number_of_servers = number_of_partitions(P) * number_of_replicas. - P dictates how much data each server store. - Contribution of this paper: - To change P dynamically at runtime. - But how to choose P without knowing what is the workload beforehand. - One answer: use the maximum possible P, but then the average CPU load increases for high P and high load. - Each cluster has some fixed overhead (network latency, etc). If P large, the overhead is higher. - Current schemes (such as the one used by google) use a fixed and large P which is difficult to change. - How google change P? - Overprovisioned clusters. - Copying of data to new set of clusters is very expensive. - Their approach: - Observation: There is no need to divide servers in disjoint clusters, but it is important to replicate every data item and arrange for every query to visit at least one replica. - Leverage consistent hashing. - Each server gets an ID and each document gets an ID. - Store data by hashing file and storing on P consecutive servers in the ring. - Query arise to the Front End (FE) server. - Compute starting point (maybe random). - Partition the query P ways and the query will hit the object at most once. - Can set higher P. Servers will be adjusted and will drop data since replication for higher P is less. - If P decreases, copy data to new replica. - Fault Tolerance - If host fails, the FE times out the query and will query other replica close in the ring. - Evaluation: - Can ROAR change P dynamically? - Start network with P = 40 - When P decreases (this is the most expensive case since it requires copying data to new replicas), query delays are not affected on average. - Q & A: - Q: You could have a bottleneck if only use one FE. Any problems if have many FEs? - A: The system has no problem scaling. - A: No problem in having many FE. Could separate the tasks for controlling P and for receiving queries. - Q: Could you use any cost function to optimize P? - A: Here we used delay of queries. But the policy could use anything. ---------------------------------------------------------- ---------------------------------------------------------- Safe and Effective Fine-grained TCP Retransmissions for Datacenter Communication Vijay Vasudevan (Carnegie Mellon University), Amar Phanishayee (Carnegie Mellon University), Hiral Shah (Carnegie Mellon University), Elie Krevat (Carnegie Mellon University), David Andersen (Carnegie Mellon University), Greg Ganger (Carnegie Mellon University), Garth Gibson (Carnegie Mellon University and Panasas, Inc), Brian Mueller (Panasas, Inc.) - Problem: - Datacenter (DC) clients and servers use TCP to communicate - Client sends request, returned packets get dropped. - Retransmission happens at 200 ms. But it may be too long for DC applications. - Application may need to wait for results from 1000s of servers and 200 ms delay per server could be a killer. - Summary of solution: - Allow microsecond retransmissions on TCP. - Can improve DC application throughput and latency. - Change in TCP does not affect network wide. - Datacenter environment: - High bandwidth links, low delay, commodity Ethernet switches. - Packet losses are frequent under heavy load. - TCP recovery from losses: - Data driven recovery (microseconds- very fast). - TCP Timeout driven (milliseconds - slow). - Retransmission takes 1000s RTTs. - Retransmission Timeout (RTO) Estimation and Minimum Bound: minimum RTO bound = 200 ms in most operative systems. - The Incast workload: - Client sends request to all servers and servers send back responses that are ensembled by the client and delivered to the application. - This simultaneous responses can overfill buffers at intermediate switches which causes packets to get dropped. - Client waits until the RTO bound to get all packets. - Experiment with a latency sensitive application: - How long it takes to receive 4MB distributed across 16 servers? - Responses delayed by the 200 ms TCP timeout caused a total delay factor of 7. - Their approach: - Allow TCP retransmissions to fire up in microseconds. - Single line change in Linux. - But this does not change RTT measurement granularity. - Still Linux TCP stack uses granularity of 1 millisecond. - If eliminate RTO Bound, provides TCP timeout in milliseconds (still too large). - To provide microseconds timeout (more difficult than single line change): - Measure RTT in microseconds. - Modify internal data structure - Timestamp option - Efficient high resolution of kernel timers to resend packets quick enough. - HPET for efficient interrupt signaling. - With this two changes and no RTO bound - good throughput. - All responses where returned within 40 microseconds in their experiments. - Is it safe? - Delayed ACK: - If RTO > 40 ms, receiver hold for 40 milliseconds and send ACK back. - If RTO < 40 ms, sender times out before 40 milliseconds and retransmits the packet (i.e. premature timeout). - Because of this, throughput dropped by 15% - It is a reasonable trade off. This is work in progress. - Network wide effect: - Can we cause collapse by sending more packets more quickly? - Do we unnecessary timeout too frequently? - Timeouts retain exponential backoff. - Spurious timeout slows rate of transfer. (might hurt performance) - But removing 200 milliseconds timeout is OK. - Showed experiment of Bittorrent with modified and normal TCP clients. Throughput is not affected. - Q&A: - Q: Flip back to slide 18. There is a drop in red curve (full implementation) - why? -A: Don't know. - Q: Giant burst losses that generate timeouts. Losses at servers synchronized because requests are synchronized. Should they be desynchronized. - A: Application level solutions can work. But this is a more general solution. - Q: Recovery from congestion conditions. - A: If retransmit faster it is bad. But there is the exponential backoff to handle this. - Q: To implement the microsecond RTO, you need to change the kernel. Is there any increase in CPU overhead? - A: Looking at the output of top, CPU overhead was not significantly higher - Q: Timeouts, doesn't matter how short, will still be bad. Solve at switching level? - A: Several different switches where considered and they don't seem to alleviate the problem. ---------------------------------------------------------- ---------------------------------------------------------- Matchmaking for Online Games and Other Latency-Sensitive P2P Systems Sharad Agarwal (Microsoft Research), Jacob R. Lorch (Microsoft Research) - Problem: - Game matchmaking - clients select groups of players with low latency to each other. - Clients send several probes. But there is also key exchange, NAT issues, connectivity issues. - Hence, clients only have time to probe few peers, so they won't find best match. - As a result, clients could be matched with annoying sluggish peers. - Summary of solution: - Propose a latency predictor: imperfect predictor that in nanoseconds can predict the best match. - Still need to do probing, but to a small set of clients. - Two general approaches: - Geolocation. - Network coordinates. - The authors combine these two (Htrae). Htrae is a latency predictor. Can be used in many different applications. - Background on Geolocation and Network coordinates - Geolocation: - Look up IP address location in data base, calculate distance (KM). This is linearly related to RTTs. - Network Coordinate System (i.e. Vivaldi) - Nodes are represented by points in space. Distance between points is a prediction of RTT. - Pyxida is the state of the art in network coordinate systems and was used for comparison. - Weakness of current approaches: - Geolocation - inflexibility. Errors in database. consistently bad prediction of RTT. - Network coordinates - sensitivity to initial conditions. - Their approach: - Geographic bootstrapping: - Use spherical instead of Euclidean space. - Use height to measure distance. - Node geolocates itself initially. - Nodes will shift around because of network coordinates. But they will start with better initial point. - Avoids poor global coordinates. - Flexible because of network coordinates. Nodes can move around. - AS corrections - There are problems estimating distance to nodes in the same AS. - When using AS correction, reduce height by 20% to increase the chance of clients selecting peers in the same AS. - Some ASes are very large. If this is the case, do not use AS correction. - Symmetric updates - Send additional packet so that A and B improve their location in coordinate system. - Evaluation - Dataset: 30 days, 3.5 million machines and 50 million probes. - Trace replay - each session, ordered by timestamp, has a client IP trying to learn server IPs. - Latency predictor, predicts the latency for each server and compares to actual latency. - How much error? - Htrae is better than geolocation and pyxida - How well can it work at finding the best server? - Metric: Compute cost for choosing the bad server. - Htrae finds the best server 70% of the time. - Compare to actual deployed systems, Oasis and IPlane - Htrae does a lot better. Deployed systems have to guess a lot. - To be more fair, discard points that systems have to guess. - Still Htrae does better than both. - How effective are predictors in client matchmaking? - Metric: probability of finding client with RTT < 75ms vs number of possible clients. - Htrae does much better than random (used today) - Small live deployment - Showed example of error in geolocation corrected by the network coordinate system component. - Q&A - Q: Possible set of IP and find a good subset of them. Could P4P or Alto approach help or have already solved this? - A: Interesting alternate approach - not evaluated. This approach may take too much time. - Q: But the ISP can do very fast since they know the topology. - Q: Is there something wrong with probing so many nodes - A: You could do key exchange before communicating and NAT probing with 100 nodes. - Q: In practice, to build system, where to get initial parameters (e.g. training)? - A: There are not that many parameters. - A: You hardwire the parameters in the system. You could update the xbox any time. - Q: 13 probes per client in data set. Is it representative? - A: 50M probes are coming directly from the data. - Q: Even with this solution, bottom 5% is really bad. Would it be better to do relative delay (using something like ONO)? - A: Maybe - Answer was not clear.