Disttibued algorithms and networks differ from their centralised counterparts in that failure is part of their DNA! i.e. they can continue despite failures and need to be designed to handle such failures. Failures can come in many forms – from crash failures (a component is knocked out and refuses to get up), transient failures (a component gets punched hard, falls down, recovers on wobbly feet!), Byzantine (a component (our fictitious boxer, as you have no doubt figured out by now) has taken money from bookies to fix the match and is trying to lose ruining the whole fight).
An adversary – the other boxer is a good way to model attacks – i.e. I seek to make a statement such that my boxer does well against the best boxer known; my boxer can stand forever against a boxer that does not use an undercut; my boxer does well against a boxer that throws random punches but may do poorly against one who knows that I didn’t have breakfast today;my referee does well against a bookie who’s paid my boxer to get the match fixed!
The paradigm which looks at keeping the distributed system running despite failures in some acceptable form (maybe degraded from the original) is called fault-tolerance. Self-healing is a fault-tolerance paradigm for reconfigurable networks, in a broad sense. Reconfigurable networks are networks which permit changes to be made quickly , say, in contrast with hard-wired networks where to change the network topology, you would have to physically disconnect and change wires. One example is overlay networks e.g the Internet which can be thought of as virtual connrections made over an underlay – a network of wirings which permit any two computers (say, with IP addresses) to connect to each other.
Self-healing can now be thought of as a game where an adverary removes/deletes/adds a node from the network/graph and the neighbours of the deleted node respond locally by creating new connections in the reconfigurable network with the objective maintaining global properties such as connectivity, distances, expansion etc while allowing only minimal increase in degree of individual nodes. Formalising and designing self-healing algorithms was the subject of my PhD thesis and this lecture should give some flavour of it.
The other topic this lecture deals with is compact routing – a very interesting topic that generated a lot of interest in the distributed graph algorithms community. Those with basic familiarity with routing will remember that to route messages in a network, two data structures are used: the routing table which is available at every node, and routing labels which are the address and related information that each packet contains. These can be thought of as corresponding to post-offices and addresses on letters to be delivered to individuals.
Now, the simplest way to biuld a routing table would be to run something like a Djikstra algorithm all pairs shortest path algorithm and at every node store which neighbour to forward a packet to for every destination. For example, if a node has three neighbours a,b, and c, and it gets a packet destined for node x, the node can see in its routing table that node b is on the shortest path from itself to x, and it can thus forward the packet to node b. This is called shortest path routing and is most desirable since it follows the quickest route to get packets across (ignoring other real world constraints such as bad weather, lazy porters etc etc..). However, the cost is that each node now stores an O(n) sized table where n is the number of nodes in the network. This is considered too much!
Thus, the research area of compact routing is obsessed with keeping the tables (and also the labels) to smaller (than n) size at the expense of approximating shortest paths – some thought on the matter leads to conclusion that this is an unavoidable tradeoff.
At the same time, what about the six degrees of seperation? – that idea states that not only do you know everybody in the world through a chain of six people, if these people were willing enough – they could even route a postcard from you to them! – thus, this is a really compact routing protocol. This effect was discovered by the Milgram experiment in the 1930s USA.
The difference is that this efffect relies a lot on geographical information i.e. Chicago is closer to New York than Albuquerque, thus, if I have a choice between a friend in New York and another in Albuquerque, I will send the packet to my friend in New York (i.e. I will do greedy routing) if my objective is to send the packet to a Barack Obama, resident of Chicago! In general, compact routing on networks has very little global information, thus, the challenge is higher. This information is gathered and a protocol built. This, however, creates a new challenge – what happens to the routing if the network changes. Compact self-healing routing is a research area that seeks to provide continuing compact routing when the network is changing, at least in the self-healing paradigm.