7.1 Churn


7.1 Churn

Figure 10: Azureus nodes follow a typical peer-to-peer lifetime distribution curve. With 78 percent of its nodes in the system for less than one hour, it is difficult to incorporate the steady stream of newcomers with coordinates starting at the origin.
\includegraphics{graphs/nc-comp/live/lifetimes}

Distributed network coordinate algorithms traditionally consider churn as part of their network model. Researchers ask the question: given an existing, stable system, how quickly can a new node find a stable, accurate coordinate? Unfortunately, implicit in this question is the assumption that the existing system has converged, and this assumption breaks down in many large-scale distributed systems, including Azureus. As Figure 10, illustrates, Azureus follows the long-tailed lifetime distribution typical of peer-to-peer systems [4]. (Azureus clients track uptime using an internal, query-able statistic.)

Figure 11: Azureus nodes that have been in the system for longer periods have more accurate coordinates. This suggests that churn may hurt convergence of Internet-scale coordinate systems.
\includegraphics{graphs/nc-comp/live/uptime-re50}

Because coordinate updates were on the order of tens of seconds or sometimes minutes apart, nodes often did not have much time to settle into a stable position before they exited the system. Using the data from our crawl of the live network, we separated nodes into ones that had been in the system for an hour or more and those that had not. We plot the relative error experienced by these two groups in Figure 11. The data confirm that these short-lived nodes, which make up the majority of the system, are substantially less accurate than long-lived ones.

Figure 12: Coordinate systems that experience high churn rates and do not allow nodes to ``remember'' their previous coordinates have trouble converging.
\includegraphics{graphs/churn/churn}

We considered three potential solutions to the problem of sustaining a coordinate system under high churn rates. First, nodes could perform a rapid initial triangulation process before shifting to a lower update rate. However, adjusting the gossip rate over time has two problems: (a) ``passive'' (i.e. maintenance-free) coordinate systems have no control over gossip and (b) in an ``active'' system, it would be a new, complex knob. Second, we considered ``greedy optimization,'' where instead of just stepping once through the update process, nodes would repeat until a (local) minimum had been reached with respect to the currently known neighbors. Unfortunately, we found that this form of optimization does not work well until many neighbors are known, which is not the case early in a node's lifetime. Finally, we found a solution that is both extremely simple and had positive results in simulation: instead of starting from scratch when restarting a client, have it begin where it left off. We performed an experiment where we varied the amount of churn in simulation and toggled whether or not nodes ``remembered'' their coordinate on re-entry. In Figure 12, we show the results of this experiment. We found that when nodes started at the origin on re-entry, they had a deleterious effect not only on themselves, but on overall system convergence. In contrast, with this simple technique, accuracy remained about the same as when there was no churn. While this technique assumes limited drift (see next section), it appears to be a promising start to resolving the noxious effect of churn on live coordinate systems.

Jonathan Ledlie 2007-02-23