Writings - The Mystery of the Broken DNS Load Balancing

Niek Sanders, April 2018

This is my second post covering weird bugs and mysteries. The story takes place many years ago at Tune.

One of our backend systems was being bottlenecked by a distributed datastore spread across several cr1.8xlarge instances. (The modern equivalent is the r4.8xlarge). The CPU metrics showed core 0 being pegged with interrupt handling which made us suspect network I/O as the culprit. The bandwidth usage on each datastore host was modest, but the packets per second being processed was high.

It's been a while, and I don't remember why we didn't use Receive Side Scaling or Receive Packet Steering to spread the network processing over more cores. Our approach back then was to add an extra Elastic Network Interface to each instance. Then we'd set our internal DNS to list two A records for each host, with one IPv4 entry per NIC. DNS round-robin would take care of the rest.

Adding the ENIs to each datastore host was painless. Next, we updated just one of the consumer hosts to use the new DNS configuration. After bouncing the consumer backend processes to point at the new DNS entries, things seemed to be working ok. But when I dug in more, netstat showed all the processes were hitting just one of the IPs for each datastore host.

We queried the DNS server directly and both IPs were listed. But... they always returned in a deterministic order! Our name server (Bind) should have been shuffling them for us.

What was going on?

xkcd.com Computer Problems Comic

Searching for help was painful. All we knew was that DNS round-robin was broken for a seemingly trivial setup. After a lot of googling about, we finally came across a crusty glibc bug report matching our symptoms.

The glibc developers claimed the behavior was not a bug by invoking RFC 3484: Default Address Selection for Internet Protocol version 6 (IPv6). It turns out that this RFC, originally created for IPv6, also impacts people who only care about IPv4 on Linux and Windows.

The RFC's introduction explains what it's about:

The end result is that IPv6 implementations will very often be faced with multiple possible source and destination addresses when initiating communication. It is desirable to have default algorithms, common across all implementations, for selecting source and destination addresses so that developers and administrators can reason about and predict the behavior of their systems.
Furthermore, dual or hybrid stack implementations, which support both IPv6 and IPv4, will very often need to choose between IPv6 and IPv4 when initiating communication. For example, when DNS name resolution yields both IPv6 and IPv4 addresses and the network protocol stack has available both IPv6 and IPv4 source addresses. In such cases, a simple policy to always prefer IPv6 or always prefer IPv4 can produce poor behavior. As one example, suppose a DNS name resolves to a global IPv6 address and a global IPv4 address. If the node has assigned a global IPv6 address and a 169.254/16 auto-configured IPv4 address, then IPv6 is the best choice for communication. But if the node has assigned only a link-local IPv6 address and a global IPv4 address, then IPv4 is the best choice for communication. The destination address selection algorithm solves this with a unified procedure for choosing among both IPv6 and IPv4 addresses.

The nefarious rule #9 from RFC 3484 kicks in when both the source and destination are on the same subnet. This was the case at Tune, with both the consumer and the datastore instances living happily inside our Virtual Private Cloud. Rule #9 demands a deterministic sort order... completely breaking DNS round-robin!

Since then, the IETF has acknowledged this undesired behavior of RFC 3484. There are also some nice write-ups available on this issue, including this one from Daniel Stenberg.

Once we knew the root cause, we hacked our consumer code to shuffle the IPs coming from DNS. It was gross but let us move forward. With round-robin working, our load was neatly splayed over two cores on each datastore server, the network processing bottleneck was eliminated, and the global throughput bottleneck on the backend was removed.

>>> Writings