BGP Path Hunting

(originally published Aug 25, 2008 on blogs.sun.com – republishing here as Oracle nuked it all, but it’s been referenced in places).

Path-hunting is the tendency of BGP, when a path is withdrawn by a remote AS, to “hunt” from one path to another, before finally converging. The problem has been described elsewhere, particularly by Geoff Huston (e.g. in this ISOC column on BGP dampening). Manav Bhatia also has a nice, graphic explanation of BGP path hunting. Tony Li and Geoff Huston additionally have authored a very interesting draft on the problem, draft-li-bgp-stability – which analogises BGP convergence with wave-front propogation, where the “front” of convergence bends with differences in path-propogation times, and expands out where ASes connect with multiple others.

This problem is very similar to[5], if not a special-case of, the well-known “count to infinity” problem inherent to the Bellman-Ford distance-vector protocols. With infinity being constrained, by BGPs built-in split-horizon behaviour, to the number of simple paths in the topology.

Path Hunting overview

In short: given some converged subset of the BGP topology, with A and B connected by a set of one or more distinct paths {1, …, n}, with the time taken for a message to propogate down path x of t[x] so that the set of paths has a directly corresponding set of propogation times T = {t[0], …, t[n]}. Then for any announcement which passes through A to B, BGP will reach a “useful convergence”[1] by or within a time of:

• For an UPDATE announcing a new/updated path: minimum(T)
• For a withdrawal: maximum(T)

The latter convergence time being due to the ‘hunting’ behaviour of BGP, and indeed consistent with the well-known “Good news travels quickly, bad news travels slowly”[3] behaviour of distance-vector protocols.

E.g. imagine A announces some prefix to C, and imagine (for simplicity) that B does not re-announce paths to E or H, with simple paths between A and B of {ACB, ACDEB, ACDFGHB, ACDFGEB, ACDEGHB}. I.e.:

```-A--C----------B
\        / \
D------E   H
\      \ /
F------G
```

So after a time of minimum(T), B will have received at least one UPDATE corresponding to a valid path and would be able to forward packets toward A for the prefix. Though the system is not converged, any further UPDATEs B receives can only cause it to select a better path, so the system is at least “usefully-converged” – it can route packets.

After a time of maximum(T), the state at each speakers will be fully converged, and B may then have the following paths in its RIB (arbitrarily making assumptions about preferences speakers may have), for whatever prefix it is, from among which B may pick its preferred path:

• From C, Path: CA
• From E, Path: EDCA
• From H, Path: HGFDCA

Note that B does not know about the ACDFGEB or ACDEGHB paths, however if the DE edge were withdrawn, E potentially could switch to G (and issue an update to B when it does so) and similarly with the FG edge and latter path. B can pick anyone of the above as its preferred path. I.e. though these paths are hidden from B, they can still come into play.

Imagine A now withdraws the prefix from C – so the connectivity to the prefix is now lost as far as the rest of the graph goes. After a time of minimum(T) B will receive at least one message (that might be a withdrawal from C, but it could just as well be an UPDATE from E as E hunts its preferred path over to G). Imagine it happens to come from the peer which it had selected as best, and results in B picking a new best path via another peer (e.g. because its a withdrawal, or an UPDATE with less favourable path-attributes). Then the next update arrives from the peer newly preferred, and so B picks again, and so on. So B’s best-path may well change between the following best-paths:

1. From E, Path EDCA
2. From H, Path HGFDCA
3. From H, Path HGEDCA
4. From E, Path EGFDCA

I.e. before B can converge on “there is no path”, it potentially will first “hunt” through various paths that are already dead[2].

Yet, B did receive a message which was triggered by As withdrawal – several in fact! If only B could use the first message to see that, as it was A which withdrew the prefix, that therefore all paths via A must be dead, then it could short-circuit the whole hunting process!

Possible Solutions

Huston and Li propose various possible changes to BGP to mitigate hunting, such as removing the MRAI timer, band-stop filtering of updates, path-exploration damping, etc, all of which are implementable without change to the BGP protocol itself. Other proposals include Brian Dickson’s second-best-backup proposal which seeks to speed up convergence by providing visibility of additional paths besides the best-path (such as the ACDFGEB path which is initially hidden to B in the example above), and also to provide a fast-path for withdrawals (this proposal requires changes to BGP). None of these proposals actually solve path-hunting, they instead try to either filter it out and/or speed it up.

Hunting can eliminated in BGP if it could be extended with mechanisms to allow BGP to recognise nodal changes in state, across paths. Other DV protocols (DSDV, AODV) solve similar problems by numbering messages/routes with a sequence-number. This sequenced-DV approach is problematic with BGP as it tends to require either that routers in an AS somehow maintain a shared counter, or that BGP must describe a more detailed router-router topology (rather than todays abstract AS-AS topology).

Some possible mechanisms that eliminate hunting:

• BGP-RCNA possible means to allow for convergence in O(minimum(T)) for all kinds of state-changes is described by Pei, Azuma, Massey and Zhang in “BGP-RCN: improving BGP convergence through root cause notification”. Speakers include an additional attribute of information with updates/withdrawals, called the “Root Cause Notification” consisting of a list of (ASN,sequence number) tuples. This allows a receiving speaker to compare announcements received via different paths and determine relative recency, so allowing it to always act on the most recent information.

The sequence number dependency is a significant barrier to deployment.

• AS Poisoning[4]

This is essentially a simplified version of BGP-RCN, that does away with the problematic sequence number. When withdrawing a prefix, a speaker can include information that “poisons” nodes (or even edges). E.g. if A sends some kind of “Poison: A” attribute with its withdrawal, and if that attribute will propogate along with the withdrawal, then B can conclude from the first withdrawal it receives that all paths that go via A are defunct. BGP can now converge on a prefix-withdrawn state in minimum(T). The downside being that convergence to the prefix-announced state instead occurs in maximum(T) – though it is possible that BGP can be “usefully converged” prior to that, which would be a more useful state than path-hunting.

So we can “fix” path-hunting, and maximise withdrawal convergence without introducing any co-dependent state with only mild changes to BGP, though at some expense to announcement convergence.

A further observation to be made is that IGP routing has switched from distance-vector to link-state (or “map-distribution”) because of the same convergence limitations. Further, there are proposals for BGP security extensions which overlay a link-stateish topology onto BGP, and the BGP-RCN approach is not far from a link-state protocol either. So it is perhaps not completely unreasonable to imagine whether some kind of topological-state mechanism could be bolted onto BGP, that could be used to augment the path-selection process.

So…

Solutions to the hunting problem either are imperfect local-hueristics (which may even make convergence worse in cases where they fail), or protocol extensions with fairly significant trade-offs (either in complexity or adverse side-effects). It seems unlikely there’s any magic bullet, given the length of experience we have with Bellman-Ford/DV protocols.

Corrections and comments would be appreciated.

1. “Useful convergence” being where speakers all have converged on some particular state, OR are in a certain semi-converged state such that any further messages still propogating can only cause speakers to transition their routes from an existing, valid path to another valid path (this can not be the case if there are any withdrawal messages still to propogate). I.e. forwarding remains loop and blackhole free.

2. This is a problem as it means that if, outside of the subset we’re consider, B has other, less-preferred connections, then B will have to wait max(T) before it can decide there are no valid paths in this subset and start to consider paths outside of this subset.

3. Does anyone know the original source of this? (Douglas Comer, “Internetworking with TCP/IP”, Prentice-Hall, 1991, seems to quote it).

4. Described only in private correspondence, to date.

5. In the sense that the cause of the count-to-infinity problem was due to the source event of a routing update being indeterminable. Split-horizon cut-out the count-infinity problem for direct loops between neighbours, by preventing updates from propogating endlessly. The addition of a record-of-path (e.g. AS_PATH in BGP) solved count-infinity for multi-hop loops, by preventing updates from propogating along any but simple paths. The remaining problem is of updates travelling along multiple, parallel paths with different propogation times, preventing us from recognising which message received is the most recent – fixed with sequence numbers in some DV protocols. Each modification has pared down DV protocols slightly, restricting where valid information can flow and how it is recognised.

BGP Multi-Exit-Discriminator

The following is an extract from a recently submitted update of mine to the Quagga documentation. Republished here under the licence of the Quagga documentation1:

The BGP MED (Multi_Exit_Discriminator) attribute has properties which can cause subtle convergence problems in BGP. These properties and problems have proven to be hard to understand, at least historically, and may still not be widely understood. The following attempts to collect together and present what is known about MED, to help operators and Quagga users in designing and configuring their networks.

SRAMs’ wireless bicycle gear shifting: Protocol analysis

It’s pretty rare that I get to blog about both cycling and networking. Hard to see how those two topics share any common ground. That’s about to change as SRAM, the American bicycle component maker, appear to have a wireless gear shifting system in advanced testing.

The system was seen at the Tour of California earlier this year, though disguised with decoy wires to make it seem like a wired, electronic system. It was also spotted recently at the USA Pro Cycling Challenge, now without the decoy wires.

You may ask, why on earth would bicycles need radio-controlled gear shifting? Which is a good question. One benefit seems to be that the wireless shifting weighs slightly less than either a conventional mechanically actuated system, or even a wired electrical one. However, that benefit surely will be very marginal. The lack of cabling may make installation and maintenance easier too. Perhaps. Another possibility is that SRAM were behind Shimano and Campagnolo in delivering an electronically controlled shifting system, and SRAM decided that going wireless would help differentiate their product from the other two. Who knows!

As someone who specialises in network protocols, I’m curious to know how it might work in terms of the messages sent, logical layers of the radio protocol, and how the different components communicate to co-ordinate shifting.

There are obvious questions: Is it secure? Are there any limitations? Will it be reliable? The brief answers to which are, yes it should be secure; yes, there seem to be some limitations over traditional shifting systems; and probably it should be quite reliable, however when it isn’t there could be quite strange behaviour.

The rest of this blog will go into the details of how this system works at a network protocol level, as gleaned from SRAMs’ patent on electronic shifting. It will look at the ramifications of the design decisions made, and also how this could affect the operation of the system in extreme circumstances where messages are lost due to, e.g., radio noise.

The quick summary:

The SRAM system, at least as described in their patent, has a number of limitations that are a consequence of the network protocol:

• It can not support more than two front chain rings
• It can not support intentional simultaneous shifting of front and rear dérailleurs

While radio noise is unlikely to be a significant problem, outside of deliberate interference or some unusual locations, should the system be affected by continued noise it may manifest itself as:

• Missed rear shifts followed by an overshift of the rear
• Overshifting rear shift, after an apparently normal front shift
• Combined unintentional shift of both dérailleurs, on an intended front shift

The system should though be relatively robust against interference, due to its use of a low-bitrate network layer. It should be secure, due to use of strong encryption, so that only those with physical access to the bicycle (e.g. the rider) can control the gear system.

The system that goes into production may differ from the system described in SRAMs’ patent, and some or all of of these limitations may not apply to it. We won’t know till it is released.

Update 20150213: I’ve just noticed the very bottom of the SRAM patent mentions that the dérailleurs can transmit their current gear position. The patent mentions this might be used to allow the front mech to adjust its trim according to which gear the rear dérailleur is in. The patent does not mention this being used in the control protocol though, e.g. to have the dérailleurs ACK the shifter messages which could make the protocol more robust to missed messages. However, the hardware described in the patent is at least capable of this, and production systems could be enhanced this way. There may still be issues left in how the system appears to allow for shifting to be distributed over two shifters, I’d need to go back and re-check.

The surprising centres of the Internet

A previous post, on “Barabási – Albert preferential attachment and the Internet“, gave a plot of  the Internet, as a sparsity map of its regular adjacency matrix, with the axes ordered by each ASes  eigencentrality:

Sparsity plot of the Internet adjacency matrix, for the UCLA IRL 2013-06 data-set, with the nodes ordered by their Eigencentrality ranking.

Each connection in the BGP AS graph is represented as a dot, connecting the AS on the one axis to the AS on the other. As the BGP AS graph is undirected, the plot ends up symmetric. The top-right corner of this plot shows that the most highly-ranked ASes are very densely interconnected. The distinct outline probably is indicative (characteristic?) of a tree-like hierarchy in the data.

Who are these top-ranked ASes though? Are they large, well-known telecommunications companies? The answer might be surprising.

Barabási – Albert preferential attachment and the Internet

The Barabási – Albert paper “Emergence of Scaling in Random Networks” helped popularise the preferential-attachment model of graphs, and its relevance to a number of real-world graphs. There’s a small, somewhat trivial tweak to that model that can be made which never the less changes its characteristics slightly, with the result possibly being more relevant to the BGP AS graph. GNU Octave and Java implementations are given for the original BA-model and the tweak. The tweak can also potentially improve performance of vectorised implementations, such as Octave/Matlab.

IPv4 addresses are running out. IPv6 seems to offer the promise of a vast, effectively limitless address space. However, is that true? Could it be that even IPv6 addresses are too limited in size to allow potential future routing problems to be tackled? Maybe, if we want scalable, efficient routing. How could that be though?

IPv4 addresses are all but exhausted, IANA having run out in 2011, APNIC shortly thereafter and RIPE in 2012.The idea is that we switch to IPv6, overcoming the obstacles to transition that have now been engineered into it (e.g. usually a completely unrelated address space from IPv4, so significantly hampering IPv6 from being able to make transparent use of existing IPv4 forwarding capabilities).

IPv6 addresses are 128 bits in length, compared to the 32 bits of IPv4. So in theory IPv6 provides a massive address space compared to IPv4, of 2128 addresses. Exactly how massive? Well, some have calculated that that is enough to assign an IPv6 address to every atom on the surface of the earth, and still have addresses to spare. If every IPv6 address was assigned to a node with just 512 bytes of memory, then some have calculated the system in total would require more energy than it would take to boil all the oceans on earth – presuming that system exists only as pure energy! So by those estimates, the IPv6 address space seems like it is so massive that we couldn’t possibly ever come close to using all of it, least not while we’re confined to earth.

Those estimates however assume the addresses are used as pure numbers with perfect efficiency. In reality this will not be the case. To make computer networking efficient we need to encode information into the structure of the addresses, particularly so when those networks become large. The lower, least-significant 64 bits of an IPv6 address are effectively reserved for use on local networks, to allow for auto-configuration of IPv6 addresses on ethernets. Changing this would be quite difficult. So IPv6 actuall;y offers only 64 bits to distinguish between different networks, and then a further 64 bits to distinguish between hosts on a given network. Further, of the upper, most-significant 64 bits (i.e. the “network prefix” portion of an IPv6 address), already at least the first, most-significant 3 bits are taken up to distinguish different blocks of IPv6 space. This means the global networking portion of the IPv6 address space only has about 61 bits available to it.

IPv4 and IPv6 BGP FIB sizes: # of prefixes on a logarithmic scale. Courtesy of Geoff Huston.

Why could this be a problem? Isn’t 61 bits more than enough to distinguish between networks globally (i.e., as on the Internet)? Well, that’s an interesting question. The global Internet routing tables have continued to grow at super-linear pace, as can be seen in the plot above (courtesy of Geoff Huston), with at least quadratic growth and possibly exponential modes of growth. IPv4 routing tables have continued to grow despite the exhaustion of the address-space, by means of de-aggregation. IPv6 routing tables are also growing fast, though from a much smaller base. There have been concerns over the years that this growth might one day become a problem, that router memory might not be able to cope with it, and so that could bottleneck Internet growth.

At present with BGP, every distinct network publishes its prefix globally and effectively every other network must keep a note of that. Hence, the Internet routing tables grow in direct proportion to the number of distinct networks in number of entries. As each entry uses an amount of memory in logarithmic proportion to the size of the network, the routing table growth is slightly faster than the size of the network in terms of memory. So as the Internet grows at a rapid pace, the amount of memory needed at each router grows even more rapidly.

If we wanted to tackle this routing table growth (and it’s not clear we need to), we would have to re-organise addressing and routing slightly. Schemes where routing table memory growth happens more slowly than growth of the network are possible. Hierarchical routing schemes existed before, and BGP even had support for hierarchical routing through CIDR and aggregation. However, as hierarchical routing could lead to very inefficient routing (packets taking very roundabout paths, compared to the shortest path), other than in carefully coördinated networks, it never gained traction for Internet routing, and support for aggregation has now been deprecated from BGP. Other schemes involving tunnelling and encapsulation are possible, but they suffer from similar, intrinsic inefficiencies.

A more promising scheme is Cowen Landmark Routing. This scheme guarantees both sub-linear growth in routing table sizes, relative to growth of the network and efficient routing. In this scheme, networks are associated with landmarks, and packet addresses contain the address of the landmark and the destination node. One problem with turning this into a practical routing system for the Internet is the addressing. How do you fit 2 network node identifiers into an IPv6 address? Currently organisations running networks are identified by AS numbers, which are now 32-bit. However, 2×32 bit = 64 bits, and there are no more than 61 bits available at present in IPv6 for inter-networking.

Generally, any routing scheme that seeks to address routing table growth is near certain to want to impose some kind of structure on the addressing in similar ways. The 64 bits of IPv6 doesn’t give much room to manoeuvre. Longer addresses or, perhaps better, flexible-length addresses, might have been better.

Are IPv6 addresses too short? Quite possibly, if we ever wish to try tackle routing table growth while retaining efficient routing.

Cerf and Kahn on why you want to keep IP fragmentation

In “A Protocol for Packet Network Intercommunication“, Vint Cerf and Bob Kahn
explain the basic, core design decisions in TCP/IP, which they created. They describe the end-to-end principle. What fascinates me most though is their explanation of why they incorporated fragmentation into IP:

We believe the long range growth and development of internetwork communication would be seriously inhibited by specifying how much larger than the minimum a packet size can be, for the following reasons.

1. If a maximum permitted packet size is specified then it becomes impossible to completely isolate the internal packet size parameters of one network from the internal packet size parameters of all other networks.
2. It would be very difficult to increase the maximum permitted packet size in response to new technology (e.g. large memory systems, higher data rate communication facilities, etc.) since this would require the agreement and then implementation by all participating networks.
3. Associative addressing and packet encryption may require the size of a particular packet to expand during transit for incorporation of new information.

Fragmentation generally is undesirable if it can be avoided, as it has a performance cost. The fragmenting router may do so on a slow-path, for example; and re-assembly at the end-host may introduce delay. As a consequence, end hosts have for a long while generally performed path-MTU-discovery (PMTUD) to discover the right overall MTU to a destination, thus allowing them to generate IP packets of just the right size (if the upper-level protocol doesn’t support some kind of segmentation, like TCP, this may still require it to generate IP fragments) and so set the “Don’t Fragment” bit on all packets and generally avoid intermediary fragmentation.  Unfortunately however PMTUD relies on ICMP messages which are sent out-of-band, and unfortunately as the internet became bigger, more and more less-than-clueful people became involved in the design and administration of the equipment needed to route IP packets. Routers started to either ignore over-size packets and (even more commonly) firewalls started to stupidly filter out nearly all ICMP – including the important “Destination Unreachable: Fragmentation Needed” ICMP message needed for PMTUD. As a consequence, end-host path-MTU discovery can be fragile. When it fails to work, the end-result is a “Path MTU blackhole”: packets get dropped for being too big at a router while the ICMP messages sent back to the host get dropped (usually elsewhere), meaning it never learns to drop its packet sizes. Where with IP fragmentation communication may be slow, but with PMTU blackholing it becomes impossible.

As a consequence of this, some upper-level applications protocols actually implement their own blackhole detection, on top of any lower-layer PMTU/segmentation support. An example being EDNS0, which specifies that EDNS0 implementations must take path-MTU into account (above the transport layer!).

So now the internet is crippled by an effective 1500 MTU. Though our equipment generally is capable of sending much larger datagrams, we have collectively failed to heed Cerf & Kahn’s wise words. The internet can not use the handy tool of encapsulation to encrypt packets, or to reroute them to mobile users. Possibly the worst aspect is that IPv6 completely removed fragmentation support. While there’s a good argument that end-end level packet resizing may be more ideal than intermediary fragmentation, as IPv6 still relies on out-of-band signalling of over-size packets, without addressing that mechanism’s fragility problem, it likely means IPv6 has cast the MTU-mess into stone for the next generation of inter-networking.

Updated: Some clarifications. Added consequence of how PMTU breaks due to ICMP filtering. Added how ULPs now have to work around these transport layer failings. Added why fragmentation was removed from IPv6, and word-smithed the conclusion a bit.

Why don’t we just reclaim unused IPv4 addresses?

With IANA recently allocating its last 2 /8s from the IPv4 free pool to APNIC, and about to announce automatic allocation of each of the last 5 /8s to the RIRs, the end of IPv4 is truly nigh. The RIRs’ pools will run out over the course of the next year, and that’ll be it – no more v4 addresses left.

However, why can’t we just reclaim unused and reserved addresses? Surely there’s a few legacy /8s assigned to organisations that could be clawed back? Couldn’t we check whether assignments are being used and reclaim ones that aren’t? What about the large, former-class-E space? Couldn’t one or more of those buy a good bit of time for IPv4?

This post examines these questions and largely concludes “no”.  The issues with IPv4 depletion are simply fundamental to its (fixed) size. The IPv4 address space simply is too small to accommodate the growth of the current internet, and reclamation is unlikely to buy any useful time.

NAT could buy some amount of time, but even the NAT space seems like it may be too small for developed-world levels of technology to be deployed globally.

If you were to “ping” or otherwise probe all the assigned address space, you might find significant chunks of it are not reachable from the public, global internet. E.g. the US military has large assignments which are not advertised, as do other organisations. So why don’t we reclaim those assignments, let the organisations switch to NAT, and make them available?

Well, just because address-space is not globally reachable does not mean it is not being used for inter-networking. The criteria for IPv4 assignments has always been a need for globally unique address-space, not a need for global reachability. Many organisations have need for private inter-networking with other organisations (financial organisations notably), which is hard to do in a scalable way with NAT. So such use is justified, and we can’t really reclaim it.

Former Class-E Space

What about the 16 /8s that comprise the former Class-E space? Even ignoring 255/8, which likely will never be useable, that’s still a fairly big chunk of address space – more than 5% of the total 32bit address space. Why not re-use that, surely that would make a big difference?

Unfortunately there are major challenges to using this address space. It has long been classed as “Reserved – for future use”, and remains so. A lot of networking code that deals with forwarding or routing packets checks that addresses are not reserved. This means if anyone were assigned addresses from this range they would find they would not be able to communicate with much of the rest of the internet. Even if most of the internet upgraded their software to fix this, the poor user would still find some sites were unreachable. The sites with the old software might not ever notice there was a problem, and might not even have an economic incentive to upgrade (“you want me risk causing problems for my network with an upgrade, to fix problems only you and a few others have?”)!

If we are forced to assign from the former-Class-E space, it will be a sign that the IPv6 rollout is perhaps in serious trouble.

The core of the problem: The size of the IPv4 address space

The nub of the problem is that IPv4 is simply too small.

IPv4 has 32 bit addresses, giving 4.29G addresses, roughly divided into 256 pieces, called /8s, for top-level assignments. Of this space, 18 /8s are reserved in their entirety for special purposes  and will never be useful for general assignment; 1 /8 is reserved  for private-networking; 16 /8s are tied up in the former Class-E space and likely not useful, as above. There are other reservations across the address space, but in smaller quantities that we can ignore in impact here. That still means 221 /8s = 3.71G address – 86% of the total address space – is available for global assignment (and the private 10/8 takes some pressure off that global space). This is equivalent to a 31.788 bit address space.

Now, average daily assignment rates have been running at above 10 /8s per year, for 2010, and approached 15 /8s towards the end. This means any reclamation effort has to recover at least 15 /8s  per year just to break even on 2010’s growth. That’s 5.9% of the total IPv4 address space, or 6.8% of the assignable address space. Is it feasible to be able to reclaim that much address space? Even if there were low-hanging fruit to cover the first year of new demand, what about there-after? Worse, demand for address space has been growing supra-linearly, particularly in Asia and Latin America. So it seems highly unlikely that any reclamation project can buy anything more than a years worth of time (and reclamation itself takes time).

Seen another way, there are approaching 7G people in the world – 6.9G in 2010. Giving 1 address for every 1.86 people (in 2010). Even if we reclaimed old-Class-E, IPv4 still only provides 3.98G = 231.89 addresses, or 1 address for every 1.73 people.

Worse we can not use the address space with perfect efficiency. Because of the need for hierarchical assignment, some space will be wasted – effectively some bits of addresses are lost to overheads such as routing. Empirical data suggests a HD-ratio of 0.86 is the best achievable assignment density. This means that with the 3.98G assignable addresses with class-E reclaim, only 3.98G0.86 = 2(31.89*0.86) = 2(27.43) = 181M will actually be useable as end-host addresses, giving 1 IPv4 address for every 38 people (in 2010)!

Yet, people in the developed world today surely use multiple IP addresses per person. They have personal computers at work and home, eBook readers, mobile phones, etc. All of which depend on numerous servers which require further IP addresses. The people in the developing world surely aspire to similar standards of technology. If we assume the density of IP/person is heavily skewed towards the reported 20% of the world population who manage to earn more than \$10/day, then that means that today each IP address is being used by around 7 people. If the skew is heavily biased towards just 10% of the world population, the figure would be around 4 people per address. It’d be interesting to get precise figures for this.

Can NAT save us?

Many organisations are likely to try buy time with NAT. But how much? NAT gives us only 16 extra bits. Assuming they were free bits, that would give us a 2(27.43+16) bit address space = 11,850G addresses. On the face of it seems like this would do for quite a while. It’d allow 1 connection at a time between every host, which is still sufficient to allow all processes to communicate with each other if a higher-level multiplexer protocol is agreed on (it’d be HTTP based, given current trends).

Unfortunately though, this won’t work with TCP, as it is. When TCP closes a connection it will go into a TIME_WAIT state, where it will not allow connections from the same (src,dst) 4-tuple. TCP remains in this state for 1 or 2 minutes on most implementations. Which means you need at least 60 ports if you want to be able to open connections to same host on average 1/s (you probably don’t want to generally, but think of bursts). For every 0.5s, you need 120 ports.

In practical terms, this means probably at least 8 bits of the port-space need to be reserved for TCP on each host. Leaving 8 bits to extend the address space with. This gives 2(27.43+8) = 46G addresses = 6.7 addresses/2010-person (NB: addresses/person instead of the person/address used above) = 0.15 people/address.

This though assumes the HD-ratio assignment-density model applies only over the scale of the IP addresses, and that the borrowed port-space will be allocated with near perfect efficiency. If that were not to be the case, if instead the extra port space also were subject to the HD-ratio model, then the numbers become instead (2(31.89+8))0.86 = 2(31.89+8)*0.86 = 21.3G addresses & 3 addresses/2010-person = 0.32 people/address.

Is that enough? Certainly for a time. It doesn’t seem a comfortable margin though, particularly as it may require some further changes to end-hosts to be truly workable.

Errata

This blog post almost certainly has mistakes, possibly significant. Those noted so far, and any other significant edits:

• Missing “not”: those assigned old-class-E addresses would not be able to communicate with much of rest of internet
• Added people/address numbers in last section, for ease of comparison with previous figures.

Network Activity Visualisations

I’m tinkering on a network protocol simulator at the moment. For debug purposes, it can provide some basic visualisation of what’s going on, e.g. highlighting links which are transmitting messages. These visualisations can sometimes be mesmerising, I find. To avoid turning the front page of this blog into a blinking mess, I’ve put them on their own page.

Making the Internet Scale Through NAT

(just want to dump this here for future reference)

It’s widely acknowledged that the internet has a scaling problem ahead of it, and an even more imminent addressing problem. The core of internet routing is completely exposed to peripheral growth in mobile/multihomed leaf networks. Various groups are looking at solutions. The IRTF’s RRG has been discussing various solutions. The IETF have a LISP WG developing one particular solution, which is quite advanced.

The essential, common idea is to split an internet address into 2 distinct part, the “locator” and the “ID” of the actual host. The core routing fabric of the internet then needs only to be concerned with routing to “locator” addresses. Figuring out how to map onward to the (presumably far more numerous and/or less aggregatable) “ID” of the specific end-host to deliver to is then a question that need only concern a boundary layer between the core of the internet and the end-hosts. There are a whole bunch of details here (including the thorny question of what exactly the “ID” is in terms of scope and semantics) which we’ll try skip over as much as possible for fear of getting bogged down in them. We will note that some proposals, in order to be as transparent and invisible to end-host transport protocols as possible, use a “map and encap” approach – effectively tunneling packets over the core of the internet. Some other proposals use a NAT-like approach, like Six-One or GSE, with routers translating from inside to outside. Most proposals come with reasonable amounts of state, some proposals appear to have quite complex architectures and/or control planes. E.g. LISP in particular seems quite complex, relative to the “dumb” internet architecture we’re used to, as seems to try to solve every possible IP multi-homing and mobile-IP problem known to man. Some proposals also rely on IPv6 deployment.

Somewhat related proposals are shim6, which adds multi-homing capable “shim” layer in between IP and transport protocols like TCP (or “Upper Layer Protocols” / ULPs), and MultiPath-TCP (MPTCP), which aims to add multi-pathing extensions to TCP. Shim6 adds a small, additional state machine to whatever state machines the ULPs require, and is not backwards compatible with non-shimmed hosts (which isn’t a great problem of itself). Shim6 requires IPv6. MPTCP is still in a very early stage, and does not appear to have described any possible proposals yet.

Not too infrequently well-engineered solutions to some problem may not quite have a high enough unilateral benefit/cost ratio  for that solution to enjoy widespread adoption. Then cheap, quick hacks that address the immediate problem without causing too many problem can win out. The clear example in the IP world being NAT. It is therefore interesting to consider what will happen if none of the solutions currently being engineer, mentioned above, gain traction.

E.g. there are quite reasonable solutions which are IPv6 specific, and IPv6 is not exactly setting the world alight. There are other proposals which are v4/v6 agnostic, but require a significant amount of bilateral deployment to be useful, e.g. between ISP and customers, and do not have any obvious immediate advantages to most parties. Etc. So if, for whatever reasons, cost/benefit ratio means none of these solutions are rolled out, and the internet stays effectively v4-only for long enough, then the following will happen:

1. Use of NAT will become ever more common, as pressure on IPv4 addresses increases
2. As the use of NAT increases, the pressure on the transport layer port space (now press-ganged into service as an additional cross-host flow identifier) will increase, causing noticeable problems to end-user applications.
3. As NAT flow-ID resources become ever more precious, so applications will become more careful to multiplex application protocols over available connections.
4. However, with increased internet growth, even NAT will become more and more a luxury, and so ever more applications will be forced to rely on application layer proxies, to allow greater concentration of applications to public IP connections – at which stage you no longer have an internet.

The primary problems in this scenario are:

• The transport protocol’s port number space has been repurposed as a cross-host flow-ID
• That port number space is far too constraining for this purpose, as it’s just 2 bytes

The quickest hack fix is to extend the transport IDs. With TCP this is relatively easy to do, by defining a new TCP option. E.g. say we add a 2 * 4-byte host ID option, one ID for src, one for the dst. When replying to a packet that carried the option, you would set the dst to the received src, obviously. This would give NAT concentrators an additional number space to help associate packets with flows and the right NAT mappings.

This only fixes things for TCP, plus the﻿ ﻿﻿﻿space in the TCP header for options is becoming quite crowded and it’s not always possible to fit another 2*4+2 bytes in (properly aligned). IP also has an option mechanism, so we could define the option for IP and it would work for all IP protocols. However, low level router designers yelp in disgust at such suggestions, as they dislike having to parse out variably placed fields in HDL; indeed it is common for high-speed hardware routers to punt packets with IP options out to a slow, software data-path. The remaining possibility is a Shim style layer, i.e. in between IP and the ULPs, but that has 0 chance of graceful fallback compatibility with existing transports.

Let’s go with the IP option header as the least worst choice. It might lead to slower connectivity, but then again if your choice is between “no connectivity, because a NAT box in the middle is maxed out” and “slow connectivity, but connectivity still” then its better than nothing. Further, if such use of an IP option became wide-spread you can bet the hardware designers would eventually hold their noses and figure out a way to make common case forwarding fast even in its presence – only NAT concentrators would need to care about it.

Ok, so where are we now? We’ve got an IP option that adds 2 secondary host IDs to the IP header, thus allowing NAT concentrators to work better. Further, NAT concentrators could now even be stateless when it comes to processing packets that have this option. All it has to do is copy the dst address from the option into the IP header dst. This would allow the NATed hosts to be reachable from the outside world! The IDs don’t even have to be 4 byte each, per se.

Essentially you now have split addressing into 2. You could think of it as a split in terms of “internet” or “global network ID” and “end-host” ID, a bit like 6-to-1 or other older proposals I can’t remember the name of now, however it’s better to think of it as being 2 different addressing “scopes”:

• The current forwarding scope, i.e. the IP header src/dst
• The next forwarding scope, i.e. the option src/dst, when the the dst differs from the current scope dst

NAT boxes now become forwarding-scope context change points. End-host addressing in a sense can be thought of as end-host@concentrator. Note that the common, core scope of the internet can be blissfully unbothered by any details of the end-host number space. Indeed, if there’s no need for cross-scope communication then they can even use different addressing technologies, e.g. one IPv4 the other IPv6 (with some modification to the “copy address” scheme above).

Note that the end-host IDs no longer need be unique, e.g. 10.0.0.1@concentrator1 can be a different host from 10.0.0.1@concentrator2. In an IPv4 world, obviously the end-host IDs (or outside/non-core scope IDs) would not be globally unique. However, this scheme has benefits even with globally unique addresses, e.g. public-IP@concentrator still is beneficial because it would allow the core internet to not have to carry the prefixes for stub/leaf public IP prefixes.

You could take it a bit further and allow for prepending of scopes in this option, so you have a stack of scopes – like the label stack in MPLS – if desired. This would allow a multi-layered onion of forwarding: end-host@inner@outer. Probably not needed though.

What do we have now:

• An IP option that adds a secondary addressing scope to packet
• A 2-layer system of forwarding, extendible to n-layer
• Decoupling of internet ID space from end-host ID space
• Compatible with existing ULPs
• Backwards compatible with IPv4 for at least some use cases
• Allows NAT to scale
• Relatively minor extension to existing, widely accepted and deployed NAT architecture
• Allows scalable use of PI addressing
• No per-connection signalling required
• Minimal state

On the last point, the system clearly is not dynamic, as Shim6 tries to be. However, if the problem being solved is provider independence in the sense of being able to change providers without having to renumber internally, then this hack is adequate. Further, even if the reachability of network@concentrator is not advertised to the internet, the reachability of concentrator must be advertised at least. So there is some potential for further layering of reachability mechanisms onto this scheme.

No doubt most readers are thinking “You’re clearly on crack“. Which is what I’d have thought if I’d read the above a year or 5 ago. However, there seems to be a high-level of inertia on today’s internet for solving the addressing crunch through NAT. There seems to be little incentive to deploy IPv6. IPv6 also doesn’t solve multi-homing or routing scalability, the solutions for which all have complexity and/or compatibility issues to varying degrees. Therefore I think it is at least worth considering what happens if the internet sticks with IPv4 and NAT even as the IPv4 addressing crunch bites.

As I show above it is at least plausible to think there may be schemes that are compatible with IPv4 and allow the internet to scale up, which can likely be extended later to allow general connectivity across the NAT/scope boundaries, still with a level of backwards compatibility. Further, this scheme does not require any dynamic state or signalling protocols, beyond initial administrative setup. However, the scheme does allow for future augmentation to add dynamic behaviours.

In short, schemes can be devised which, while not solving every problem immediately, can deliver incremental benefits, by solving just a few pressing problems first and being extendible to the remaining problems over time. It’s at least worth thinking about them.

[Corrections for typos, nits and outright crack-headedness would be appreciated, as well as any comments]