Ajay Mahimkar (University of Texas at Austin, US); Jennifer Yates (AT&T Labs - Research, US); Yin Zhang (University of Texas at Austin, US); Aman Shaikh (AT&T Labs - Research, US); Jia Wang (AT&T Labs - Research, US); Zihui Ge (AT&T Labs - Research, US); Cheng Ee (AT&T, US)
Chronic network conditions are caused by performance impairing events that occur intermittently over an extended time period. Such conditions can cause repeated performance degradation to customers, and sometimes can even turn into serious hard failures. It is therefore critical to troubleshoot and repair chronic network conditions in a timely fashion in order to ensure high reliability and performance in large IP networks. Today, troubleshooting chronic conditions is often performed manually, making it a tedious, time-consuming and error-prone process. In this paper, we present NICE (Network-wide Information Correlation and Exploration), a novel infrastructure that allows the troubleshooting of chronic network conditions by detecting and analyzing statistical correlations across multiple data sources. NICE uses a novel circular permutation test to determine the statistical significance of correlation. It also allows flexible analysis at various spatial granularity (e.g., link, router, network level, etc.). We validate NICE using real measurement data collected at a tier-1 ISP network. The results are quite positive. We then apply NICE to troubleshoot real network issues in the tier-1 ISP network. In all three case studies conducted so far, NICE successfully uncovers previously unknown chronic network conditions, resulting in improved network operations.