Posts Tagged MTU mess

Cerf and Kahn on why you want to keep IP fragmentation

In “A Protocol for Packet Network Intercommunication“, Vint Cerf and Bob Kahn
explain the basic, core design decisions in TCP/IP, which they created. They describe the end-to-end principle. What fascinates me most though is their explanation of why they incorporated fragmentation into IP:

We believe the long range growth and development of internetwork communication would be seriously inhibited by specifying how much larger than the minimum a packet size can be, for the following reasons.

  1. If a maximum permitted packet size is specified then it becomes impossible to completely isolate the internal packet size parameters of one network from the internal packet size parameters of all other networks.
  2. It would be very difficult to increase the maximum permitted packet size in response to new technology (e.g. large memory systems, higher data rate communication facilities, etc.) since this would require the agreement and then implementation by all participating networks.
  3. Associative addressing and packet encryption may require the size of a particular packet to expand during transit for incorporation of new information.

Fragmentation generally is undesirable if it can be avoided, as it has a performance cost. The fragmenting router may do so on a slow-path, for example; and re-assembly at the end-host may introduce delay. As a consequence, end hosts have for a long while generally performed path-MTU-discovery (PMTUD) to discover the right overall MTU to a destination, thus allowing them to generate IP packets of just the right size (if the upper-level protocol doesn’t support some kind of segmentation, like TCP, this may still require it to generate IP fragments) and so set the “Don’t Fragment” bit on all packets and generally avoid intermediary fragmentation.  Unfortunately however PMTUD relies on ICMP messages which are sent out-of-band, and unfortunately as the internet became bigger, more and more less-than-clueful people became involved in the design and administration of the equipment needed to route IP packets. Routers started to either ignore over-size packets and (even more commonly) firewalls started to stupidly filter out nearly all ICMP – including the important “Destination Unreachable: Fragmentation Needed” ICMP message needed for PMTUD. As a consequence, end-host path-MTU discovery can be fragile. When it fails to work, the end-result is a “Path MTU blackhole”: packets get dropped for being too big at a router while the ICMP messages sent back to the host get dropped (usually elsewhere), meaning it never learns to drop its packet sizes. Where with IP fragmentation communication may be slow, but with PMTU blackholing it becomes impossible.

As a consequence of this, some upper-level applications protocols actually implement their own blackhole detection, on top of any lower-layer PMTU/segmentation support. An example being EDNS0, which specifies that EDNS0 implementations must take path-MTU into account (above the transport layer!).

So now the internet is crippled by an effective 1500 MTU. Though our equipment generally is capable of sending much larger datagrams, we have collectively failed to heed Cerf & Kahn’s wise words. The internet can not use the handy tool of encapsulation to encrypt packets, or to reroute them to mobile users. Possibly the worst aspect is that IPv6 completely removed fragmentation support. While there’s a good argument that end-end level packet resizing may be more ideal than intermediary fragmentation, as IPv6 still relies on out-of-band signalling of over-size packets, without addressing that mechanism’s fragility problem, it likely means IPv6 has cast the MTU-mess into stone for the next generation of inter-networking.

Updated: Some clarifications. Added consequence of how PMTU breaks due to ICMP filtering. Added how ULPs now have to work around these transport layer failings. Added why fragmentation was removed from IPv6, and word-smithed the conclusion a bit.

Comments (1)

%d bloggers like this: