## BGP Path Hunting

(originally published Aug 25, 2008 on blogs.sun.com – republishing here as Oracle nuked it all, but it’s been referenced in places).

Path-hunting is the tendency of BGP, when a path is withdrawn by a remote AS, to “hunt” from one path to another, before finally converging. The problem has been described elsewhere, particularly by Geoff Huston (e.g. in this ISOC column on BGP dampening). Manav Bhatia also has a nice, graphic explanation of BGP path hunting. Tony Li and Geoff Huston additionally have authored a very interesting draft on the problem, draft-li-bgp-stability – which analogises BGP convergence with wave-front propogation, where the “front” of convergence bends with differences in path-propogation times, and expands out where ASes connect with multiple others.

This problem is very similar to[5], if not a special-case of, the well-known “count to infinity” problem inherent to the Bellman-Ford distance-vector protocols. With infinity being constrained, by BGPs built-in split-horizon behaviour, to the number of simple paths in the topology.

Path Hunting overview

In short: given some converged subset of the BGP topology, with A and B connected by a set of one or more distinct paths {1, …, n}, with the time taken for a message to propogate down path x of t[x] so that the set of paths has a directly corresponding set of propogation times T = {t[0], …, t[n]}. Then for any announcement which passes through A to B, BGP will reach a “useful convergence”[1] by or within a time of:

• For an UPDATE announcing a new/updated path: minimum(T)
• For a withdrawal: maximum(T)

The latter convergence time being due to the ‘hunting’ behaviour of BGP, and indeed consistent with the well-known “Good news travels quickly, bad news travels slowly”[3] behaviour of distance-vector protocols.

E.g. imagine A announces some prefix to C, and imagine (for simplicity) that B does not re-announce paths to E or H, with simple paths between A and B of {ACB, ACDEB, ACDFGHB, ACDFGEB, ACDEGHB}. I.e.:

```-A--C----------B
\        / \
D------E   H
\      \ /
F------G
```

So after a time of minimum(T), B will have received at least one UPDATE corresponding to a valid path and would be able to forward packets toward A for the prefix. Though the system is not converged, any further UPDATEs B receives can only cause it to select a better path, so the system is at least “usefully-converged” – it can route packets.

After a time of maximum(T), the state at each speakers will be fully converged, and B may then have the following paths in its RIB (arbitrarily making assumptions about preferences speakers may have), for whatever prefix it is, from among which B may pick its preferred path:

• From C, Path: CA
• From E, Path: EDCA
• From H, Path: HGFDCA

Note that B does not know about the ACDFGEB or ACDEGHB paths, however if the DE edge were withdrawn, E potentially could switch to G (and issue an update to B when it does so) and similarly with the FG edge and latter path. B can pick anyone of the above as its preferred path. I.e. though these paths are hidden from B, they can still come into play.

Imagine A now withdraws the prefix from C – so the connectivity to the prefix is now lost as far as the rest of the graph goes. After a time of minimum(T) B will receive at least one message (that might be a withdrawal from C, but it could just as well be an UPDATE from E as E hunts its preferred path over to G). Imagine it happens to come from the peer which it had selected as best, and results in B picking a new best path via another peer (e.g. because its a withdrawal, or an UPDATE with less favourable path-attributes). Then the next update arrives from the peer newly preferred, and so B picks again, and so on. So B’s best-path may well change between the following best-paths:

1. From E, Path EDCA
2. From H, Path HGFDCA
3. From H, Path HGEDCA
4. From E, Path EGFDCA

I.e. before B can converge on “there is no path”, it potentially will first “hunt” through various paths that are already dead[2].

Yet, B did receive a message which was triggered by As withdrawal – several in fact! If only B could use the first message to see that, as it was A which withdrew the prefix, that therefore all paths via A must be dead, then it could short-circuit the whole hunting process!

Possible Solutions

Huston and Li propose various possible changes to BGP to mitigate hunting, such as removing the MRAI timer, band-stop filtering of updates, path-exploration damping, etc, all of which are implementable without change to the BGP protocol itself. Other proposals include Brian Dickson’s second-best-backup proposal which seeks to speed up convergence by providing visibility of additional paths besides the best-path (such as the ACDFGEB path which is initially hidden to B in the example above), and also to provide a fast-path for withdrawals (this proposal requires changes to BGP). None of these proposals actually solve path-hunting, they instead try to either filter it out and/or speed it up.

Hunting can eliminated in BGP if it could be extended with mechanisms to allow BGP to recognise nodal changes in state, across paths. Other DV protocols (DSDV, AODV) solve similar problems by numbering messages/routes with a sequence-number. This sequenced-DV approach is problematic with BGP as it tends to require either that routers in an AS somehow maintain a shared counter, or that BGP must describe a more detailed router-router topology (rather than todays abstract AS-AS topology).

Some possible mechanisms that eliminate hunting:

• BGP-RCNA possible means to allow for convergence in O(minimum(T)) for all kinds of state-changes is described by Pei, Azuma, Massey and Zhang in “BGP-RCN: improving BGP convergence through root cause notification”. Speakers include an additional attribute of information with updates/withdrawals, called the “Root Cause Notification” consisting of a list of (ASN,sequence number) tuples. This allows a receiving speaker to compare announcements received via different paths and determine relative recency, so allowing it to always act on the most recent information.

The sequence number dependency is a significant barrier to deployment.

• AS Poisoning[4]

This is essentially a simplified version of BGP-RCN, that does away with the problematic sequence number. When withdrawing a prefix, a speaker can include information that “poisons” nodes (or even edges). E.g. if A sends some kind of “Poison: A” attribute with its withdrawal, and if that attribute will propogate along with the withdrawal, then B can conclude from the first withdrawal it receives that all paths that go via A are defunct. BGP can now converge on a prefix-withdrawn state in minimum(T). The downside being that convergence to the prefix-announced state instead occurs in maximum(T) – though it is possible that BGP can be “usefully converged” prior to that, which would be a more useful state than path-hunting.

So we can “fix” path-hunting, and maximise withdrawal convergence without introducing any co-dependent state with only mild changes to BGP, though at some expense to announcement convergence.

A further observation to be made is that IGP routing has switched from distance-vector to link-state (or “map-distribution”) because of the same convergence limitations. Further, there are proposals for BGP security extensions which overlay a link-stateish topology onto BGP, and the BGP-RCN approach is not far from a link-state protocol either. So it is perhaps not completely unreasonable to imagine whether some kind of topological-state mechanism could be bolted onto BGP, that could be used to augment the path-selection process.

So…

Solutions to the hunting problem either are imperfect local-hueristics (which may even make convergence worse in cases where they fail), or protocol extensions with fairly significant trade-offs (either in complexity or adverse side-effects). It seems unlikely there’s any magic bullet, given the length of experience we have with Bellman-Ford/DV protocols.

Corrections and comments would be appreciated.

1. “Useful convergence” being where speakers all have converged on some particular state, OR are in a certain semi-converged state such that any further messages still propogating can only cause speakers to transition their routes from an existing, valid path to another valid path (this can not be the case if there are any withdrawal messages still to propogate). I.e. forwarding remains loop and blackhole free.

2. This is a problem as it means that if, outside of the subset we’re consider, B has other, less-preferred connections, then B will have to wait max(T) before it can decide there are no valid paths in this subset and start to consider paths outside of this subset.

3. Does anyone know the original source of this? (Douglas Comer, “Internetworking with TCP/IP”, Prentice-Hall, 1991, seems to quote it).

4. Described only in private correspondence, to date.

5. In the sense that the cause of the count-to-infinity problem was due to the source event of a routing update being indeterminable. Split-horizon cut-out the count-infinity problem for direct loops between neighbours, by preventing updates from propogating endlessly. The addition of a record-of-path (e.g. AS_PATH in BGP) solved count-infinity for multi-hop loops, by preventing updates from propogating along any but simple paths. The remaining problem is of updates travelling along multiple, parallel paths with different propogation times, preventing us from recognising which message received is the most recent – fixed with sequence numbers in some DV protocols. Each modification has pared down DV protocols slightly, restricting where valid information can flow and how it is recognised.

## BGP Multi-Exit-Discriminator

The following is an extract from a recently submitted update of mine to the Quagga documentation. Republished here under the licence of the Quagga documentation1:

The BGP MED (Multi_Exit_Discriminator) attribute has properties which can cause subtle convergence problems in BGP. These properties and problems have proven to be hard to understand, at least historically, and may still not be widely understood. The following attempts to collect together and present what is known about MED, to help operators and Quagga users in designing and configuring their networks.

## SRAMs’ wireless bicycle gear shifting: Protocol analysis

It’s pretty rare that I get to blog about both cycling and networking. Hard to see how those two topics share any common ground. That’s about to change as SRAM, the American bicycle component maker, appear to have a wireless gear shifting system in advanced testing.

The system was seen at the Tour of California earlier this year, though disguised with decoy wires to make it seem like a wired, electronic system. It was also spotted recently at the USA Pro Cycling Challenge, now without the decoy wires.

You may ask, why on earth would bicycles need radio-controlled gear shifting? Which is a good question. One benefit seems to be that the wireless shifting weighs slightly less than either a conventional mechanically actuated system, or even a wired electrical one. However, that benefit surely will be very marginal. The lack of cabling may make installation and maintenance easier too. Perhaps. Another possibility is that SRAM were behind Shimano and Campagnolo in delivering an electronically controlled shifting system, and SRAM decided that going wireless would help differentiate their product from the other two. Who knows!

As someone who specialises in network protocols, I’m curious to know how it might work in terms of the messages sent, logical layers of the radio protocol, and how the different components communicate to co-ordinate shifting.

There are obvious questions: Is it secure? Are there any limitations? Will it be reliable? The brief answers to which are, yes it should be secure; yes, there seem to be some limitations over traditional shifting systems; and probably it should be quite reliable, however when it isn’t there could be quite strange behaviour.

The rest of this blog will go into the details of how this system works at a network protocol level, as gleaned from SRAMs’ patent on electronic shifting. It will look at the ramifications of the design decisions made, and also how this could affect the operation of the system in extreme circumstances where messages are lost due to, e.g., radio noise.

#### The quick summary:

The SRAM system, at least as described in their patent, has a number of limitations that are a consequence of the network protocol:

• It can not support more than two front chain rings
• It can not support intentional simultaneous shifting of front and rear dérailleurs

While radio noise is unlikely to be a significant problem, outside of deliberate interference or some unusual locations, should the system be affected by continued noise it may manifest itself as:

• Missed rear shifts followed by an overshift of the rear
• Overshifting rear shift, after an apparently normal front shift
• Combined unintentional shift of both dérailleurs, on an intended front shift

The system should though be relatively robust against interference, due to its use of a low-bitrate network layer. It should be secure, due to use of strong encryption, so that only those with physical access to the bicycle (e.g. the rider) can control the gear system.

The system that goes into production may differ from the system described in SRAMs’ patent, and some or all of of these limitations may not apply to it. We won’t know till it is released.

Update 20150213: I’ve just noticed the very bottom of the SRAM patent mentions that the dérailleurs can transmit their current gear position. The patent mentions this might be used to allow the front mech to adjust its trim according to which gear the rear dérailleur is in. The patent does not mention this being used in the control protocol though, e.g. to have the dérailleurs ACK the shifter messages which could make the protocol more robust to missed messages. However, the hardware described in the patent is at least capable of this, and production systems could be enhanced this way. There may still be issues left in how the system appears to allow for shifting to be distributed over two shifters, I’d need to go back and re-check.

## Graph distance using matrix multiplication in Octave

If A is a matrix of positive edge weights between nodes in a graph (from 0 to infinity), then power-iteration through multiplication of that matrix in the {R,min,plus} semi-ring will converge on the distances of the shortest-paths in that graph. For reference (my own mostly), this blog entry gives the GNU Octave code for it and an example.

## Transport Scotland and its attitude to active transport spending

Transport Scotland had planned to spend £27m on low-carbon, sustainable and active transport, of which cycling is but a part, over  3 years (I presume roughly from 2013 to 2015?). Which makes the annual budget be £9m.

Now, if you were about to hand out £424k of that £9m – just about 5% of the entire annual budget for low-carbon, sustainable, and active transport – you’d do your homework on it, wouldn’t you? You might want a detailed proposal, with goals and metrics, perhaps? You’d want to see some detailed proposals for what the campaign might cover, no? You’d possibly need some back and forth to give feedback and work out the details, which’d generate minutes and emails, right? Surely?

Not if you’re Transport Scotland. No. Transport Scotland, it seems, will hand out £424k – again that’s pretty much 5% of the entire annual, active transport budget – based on nothing more than barely 2-pages of a proposal. A proposal with the scantest of details, and which couldn’t have taken more than 30 minutes to write up. That’s all it takes to get £424k from Transport Scotland, at least in terms of anything that leaves a record, apparently.

This is Transport Scotland’s official stance, made in response to my Freedom of Information request about the commissioning of the Nice Way Code, which they re-iterated to the Scottish Information Commissioner, after I appealed on grounds of incredulity.

I think this stinks of a somewhat cavalier, uncaring, and dismissive attitude both to public money, and to active transport policy in Scotland. I would expect Transport Scotland to be a lot more careful with 5% of their budget for active transport.

I think that’s fairly scandalous.

## The surprising centres of the Internet

A previous post, on “Barabási – Albert preferential attachment and the Internet“, gave a plot of  the Internet, as a sparsity map of its regular adjacency matrix, with the axes ordered by each ASes  eigencentrality:

Sparsity plot of the Internet adjacency matrix, for the UCLA IRL 2013-06 data-set, with the nodes ordered by their Eigencentrality ranking.

Each connection in the BGP AS graph is represented as a dot, connecting the AS on the one axis to the AS on the other. As the BGP AS graph is undirected, the plot ends up symmetric. The top-right corner of this plot shows that the most highly-ranked ASes are very densely interconnected. The distinct outline probably is indicative (characteristic?) of a tree-like hierarchy in the data.

Who are these top-ranked ASes though? Are they large, well-known telecommunications companies? The answer might be surprising.

## Code and error handling strategies and FSMs

We’ve had a discussion in the office on and off over a few days about error handling in programmes. Neither my office mate nor I are completely happy with what’s available out there. The state of the art in imperative/OO languages appears to be stuck at exceptions. The new Go programming language eschews them, essentially going back to having a named parameter for returning errors – which needs to be checked for.

The essential problem in programming is, of course, that code may follow a number of paths. Some of those paths are, at some level, the most common or desired paths (often, just 1 path). Other paths may be taken, e.g. because a resource was unavailable or because the requested action was inappropriate or unreasonable in some way. Of those other paths, some may still be reasonably commonly taken (e.g. a file not existing) and/or recoverable in some way. Other paths may be exceptional in some way such that it is unlikely  to recover from having taken them. These paths converge at function/procedure call nodes, where they have to be dispatched to direct programme flow onward to the appropriate further path.

My own sense is that path selection syntax/support in programming languages tends to divide all error/return handling into 1 of those 3 cases.

1. The primary path(s)
2. Recoverable error paths
3. Unrecoverable error paths

What then are the ways to handle these cases?  In languages such as C, errors generally must be handled after each function call, acting on them according to return value or a modified parameter. There are of course other ways, such as signals and exceptions.

There is also another way, which I see much less often, but which has certain benefits, which is to use Finite State Machines (FSMs) to give objects an idempotent error state. This allows control-flow in calling code to bunch together operations without having to worry about errors, as the error-handling can be done later, separately.

This blog tries to go over these methods, and their benefits and pitfalls. I’d also be really interested to hear about other error-handling strategies, especially better ones! 🙂

## Barabási – Albert preferential attachment and the Internet

The Barabási – Albert paper “Emergence of Scaling in Random Networks” helped popularise the preferential-attachment model of graphs, and its relevance to a number of real-world graphs. There’s a small, somewhat trivial tweak to that model that can be made which never the less changes its characteristics slightly, with the result possibly being more relevant to the BGP AS graph. GNU Octave and Java implementations are given for the original BA-model and the tweak. The tweak can also potentially improve performance of vectorised implementations, such as Octave/Matlab.

## Are IPv6 addresses too small?

IPv4 addresses are running out. IPv6 seems to offer the promise of a vast, effectively limitless address space. However, is that true? Could it be that even IPv6 addresses are too limited in size to allow potential future routing problems to be tackled? Maybe, if we want scalable, efficient routing. How could that be though?

IPv4 addresses are all but exhausted, IANA having run out in 2011, APNIC shortly thereafter and RIPE in 2012.The idea is that we switch to IPv6, overcoming the obstacles to transition that have now been engineered into it (e.g. usually a completely unrelated address space from IPv4, so significantly hampering IPv6 from being able to make transparent use of existing IPv4 forwarding capabilities).

IPv6 addresses are 128 bits in length, compared to the 32 bits of IPv4. So in theory IPv6 provides a massive address space compared to IPv4, of 2128 addresses. Exactly how massive? Well, some have calculated that that is enough to assign an IPv6 address to every atom on the surface of the earth, and still have addresses to spare. If every IPv6 address was assigned to a node with just 512 bytes of memory, then some have calculated the system in total would require more energy than it would take to boil all the oceans on earth – presuming that system exists only as pure energy! So by those estimates, the IPv6 address space seems like it is so massive that we couldn’t possibly ever come close to using all of it, least not while we’re confined to earth.

Those estimates however assume the addresses are used as pure numbers with perfect efficiency. In reality this will not be the case. To make computer networking efficient we need to encode information into the structure of the addresses, particularly so when those networks become large. The lower, least-significant 64 bits of an IPv6 address are effectively reserved for use on local networks, to allow for auto-configuration of IPv6 addresses on ethernets. Changing this would be quite difficult. So IPv6 actuall;y offers only 64 bits to distinguish between different networks, and then a further 64 bits to distinguish between hosts on a given network. Further, of the upper, most-significant 64 bits (i.e. the “network prefix” portion of an IPv6 address), already at least the first, most-significant 3 bits are taken up to distinguish different blocks of IPv6 space. This means the global networking portion of the IPv6 address space only has about 61 bits available to it.

IPv4 and IPv6 BGP FIB sizes: # of prefixes on a logarithmic scale. Courtesy of Geoff Huston.

Why could this be a problem? Isn’t 61 bits more than enough to distinguish between networks globally (i.e., as on the Internet)? Well, that’s an interesting question. The global Internet routing tables have continued to grow at super-linear pace, as can be seen in the plot above (courtesy of Geoff Huston), with at least quadratic growth and possibly exponential modes of growth. IPv4 routing tables have continued to grow despite the exhaustion of the address-space, by means of de-aggregation. IPv6 routing tables are also growing fast, though from a much smaller base. There have been concerns over the years that this growth might one day become a problem, that router memory might not be able to cope with it, and so that could bottleneck Internet growth.

At present with BGP, every distinct network publishes its prefix globally and effectively every other network must keep a note of that. Hence, the Internet routing tables grow in direct proportion to the number of distinct networks in number of entries. As each entry uses an amount of memory in logarithmic proportion to the size of the network, the routing table growth is slightly faster than the size of the network in terms of memory. So as the Internet grows at a rapid pace, the amount of memory needed at each router grows even more rapidly.

If we wanted to tackle this routing table growth (and it’s not clear we need to), we would have to re-organise addressing and routing slightly. Schemes where routing table memory growth happens more slowly than growth of the network are possible. Hierarchical routing schemes existed before, and BGP even had support for hierarchical routing through CIDR and aggregation. However, as hierarchical routing could lead to very inefficient routing (packets taking very roundabout paths, compared to the shortest path), other than in carefully coördinated networks, it never gained traction for Internet routing, and support for aggregation has now been deprecated from BGP. Other schemes involving tunnelling and encapsulation are possible, but they suffer from similar, intrinsic inefficiencies.

A more promising scheme is Cowen Landmark Routing. This scheme guarantees both sub-linear growth in routing table sizes, relative to growth of the network and efficient routing. In this scheme, networks are associated with landmarks, and packet addresses contain the address of the landmark and the destination node. One problem with turning this into a practical routing system for the Internet is the addressing. How do you fit 2 network node identifiers into an IPv6 address? Currently organisations running networks are identified by AS numbers, which are now 32-bit. However, 2×32 bit = 64 bits, and there are no more than 61 bits available at present in IPv6 for inter-networking.

Generally, any routing scheme that seeks to address routing table growth is near certain to want to impose some kind of structure on the addressing in similar ways. The 64 bits of IPv6 doesn’t give much room to manoeuvre. Longer addresses or, perhaps better, flexible-length addresses, might have been better.

Are IPv6 addresses too short? Quite possibly, if we ever wish to try tackle routing table growth while retaining efficient routing.

## UK Cycling Culture

There’s things I find fascinatingly perverse about cycling culture in the UK, having experienced cycling in the Netherlands as a child. By cycling culture I mean both the culture and norms amongst those who regularly cycle in the UK (and the various sub-cultures), as well as in the general culture.

It struck me most the other week, when I cycled out of Glasgow early on the morning of the Pedal For Scotland event, out along the road leading back to Glasgow Green, where that event starts. There was a little more traffic than normal for a Sunday morning, because of the cars heading to PFS. It hit me that the UK is a country where, every weekend, people drive to cycle. They drive out to parks, country lanes, and sportives. They drive along roads they dislike cycling on, out to those roads to cycle on that are least quiet, if not free of cars. They drive out to sportives on open roads,  looking for a feeling of safety in numbers.

Clearly there is a great demand for cycling. Those ferrying their bicycles on their cars are just the tip of the iceberg – there are surely many more potential, casual cyclists who don’t care enough to invest in cycle racks, etc. Yet, the cycling culture, nay, the entire transport culture is so perverse here.

• The letter of the law is that children must cycle in with 50 km/h (30 mph), even 65 km/h (40 mph), motor traffic on many roads, because there is no cycle-path provision and it is illegal for them to cycle on the footpath. Only children under the age of criminal responsibility can safely get away with this, as they will not be prosecuted. However, legally, little children are supposed to cycle in amongst fast motor traffic on such roads!
• Even adults can feel uncomfortable cycling on these roads, and will cycle on the footpaths. The Scottish governments’ answer to this problem of a clear lack of safe cycling infrastructure? Run an ad campaign to tell such adults to grow up.
• There is never any room for dedicated cycle paths, even though most towns and cities are criss-crossed by multi-lane roads.
• At best, the cyclist gets either a useless little lane painted on the road that does nothing to protect them from motor traffic, but does bring them through the dangerous door-zone of parked cars; or the footpath gets designated “shared use”, bringing the cyclists into conflict with pedestrians instead, and having to cede priority at every little junction and property exit. The more experienced cyclists often ignore such useless infrastructure, with good reason.
• A sub-culture of cyclists are derisory of any attempt to argue for quality dedicated cycling infrastructure.
• A sub-culture of cyclists are afflicted by Stockholm Syndrome and believe in mandatory helmets for cyclists, even mandatory HiViz.
• Organisers of low-risk cycling events often impose a requirement to wear helmets on participants, typically by justifying it as a requirement of their insurer (even when false, e.g. British Cycling non-race event insurance does not require helmet use), thus condoning the ongoing “dangerising” and de-normalisation of cycling in the UK.
• Motorists who injure or kill cyclists are given derisory sentences, if they are even prosecuted and found guilty. Even motorists who deliberately knock down cyclists do not go to jail. If you want to murder someone in this country, use a car and claim “the sun was in my eyes” or “a bee came into my car”.

I could go on and on. It’s almost enough to make me despair.