7.5 Latency Variance


7.5 Latency Variance

Figure 15: When round trip times vary by amedian of 183ms, what does it mean to summarize a latency prediction with a single value?
\includegraphics{graphs/az/rtt/rtt-sd}

The prior ``barriers to accuracy'' paint a rosy picture; most problems have a fairly simple solution that practitioners can use to build more accurate, live coordinate systems. The existence of wide variation in latency measurements between the same pair of nodes over a short period of time is a harder problem with broad ramifications. If variances are very large what does it actually mean to ``predict'' the latency from one node to another? Using the data from our longest snapshot (5D), we show the standard deviation of latency between each pair of nodes in Figure 15. This problem affects other latency prediction systems as well. A reactive measurement service, such as Meridian, will be more error-prone or have higher overhead if small numbers of pings do not sufficiently measure the latency to a high variance target. In fact, coordinate systems may be in a better position to address this problem because they can retain histories of inter-node behavior.

As reviewed in Section 4.1, we developed latency filters in previous work. They act as a low-pass filter: anomalies are ignored while a baseline signal passes through. Additionally, they adapt to shifts in the baseline that BGP route changes cause, for example. These filters assign a link a single value that conveys the expected latency of the link. While we found these simple filters worked well on PlanetLab, describing a link with a single value is not appropriate with the enormous variance we observe on some of Azureus' links.

Figure 16: A comparison of round trip times between two sets of node pairs using ICMP, raw application-level measurements, and filtered measurements. Pair (a) exhibits some variance, but shows a consistent baseline.
\includegraphics{data/rtt/res/fplanetlab2-iis-sinica-edu-tw-t12-226-19-3-cdf} \includegraphics{data/rtt/freq/icmp-vs-time-fplanetlab2-iis-sinica-edu-tw-t12-226-19-3}

Figure 17: With pair (b), the variance is so large that assigning this node a coordinate -- or putting it into a consistent Meridian ring -- is bound to be an error-prone process. The number in parentheses in the legend is the number of round trip time measurements in the cumulative distribution function.
\includegraphics{data/rtt/res/fplanetlab14-millennium-berkeley-edu-t84-18-25-152-cdf} \includegraphics{data/rtt/freq/icmp-vs-time-fplanetlab14-millennium-berkeley-edu-t84-18-25-152}

We ran an experiment where we compared ICMP, filtered, and raw latency measurements that were taken at the same time. To determine which destination nodes to use, we started Azureus on three PlanetLab nodes and chose five ping-able neighbors after a twenty-minute start-up period. We then let Azureus continue to run normally for six hours while simultaneously measuring the latency to these nodes with ping. We plot the data in Figures 16 and 17. Figure 16 illustrates a pair similar to our PlanetLab observations: there was raw application-level and ICMP variance, but a consistent baseline that could be described with a single value. In contrast, Figure 17 portrays a high variance pair: while the filter does approximate the median round trip time, it is difficult to say, at any point in time, what the latency is between this pair.

The impact of the dual problems of high latency variance and modifying algorithms to deal with high latency variance is not limited to network coordinate systems. Latency and anycast services deployed ``in the wild'' need to address this problem. While there may exist methods to incorporate this variance into coordinate systems -- either through ``uncertainty'' in the latency filters or in the coordinates themselves -- resolving this problem is beyond the scope of this paper.

Jonathan Ledlie 2007-02-23