This is my second post covering weird bugs and mysteries. The story takes place many years ago at Tune.
One of our backend systems was being bottlenecked by a distributed datastore spread across several cr1.8xlarge instances. (The modern equivalent is the r4.8xlarge). The CPU metrics showed core 0 being pegged with interrupt handling which made us suspect network I/O as the culprit. The bandwidth usage on each datastore host was modest, but the packets per second being processed was high.
It's been a while, and I don't remember why we didn't use Receive Side Scaling or Receive Packet Steering to spread the network processing over more cores. Our approach back then was to add an extra Elastic Network Interface to each instance. Then we'd set our internal DNS to list two A records for each host, with one IPv4 entry per NIC. DNS round-robin would take care of the rest.
Adding the ENIs to each datastore host was painless. Next, we updated just one of the consumer hosts to use the new DNS configuration. After bouncing the consumer backend processes to point at the new DNS entries, things seemed to be working ok. But when I dug in more, netstat showed all the processes were hitting just one of the IPs for each datastore host.
We queried the DNS server directly and both IPs were listed. But... they always returned in a deterministic order! Our name server (Bind) should have been shuffling them for us.
What was going on?
Searching for help was painful. All we knew was that DNS round-robin was broken for a seemingly trivial setup. After a lot of googling about, we finally came across a crusty glibc bug report matching our symptoms.
The glibc developers claimed the behavior was not a bug by invoking RFC 3484: Default Address Selection for Internet Protocol version 6 (IPv6). It turns out that this RFC, originally created for IPv6, also impacts people who only care about IPv4 on Linux and Windows.
The RFC's introduction explains what it's about:
The nefarious rule #9 from RFC 3484 kicks in when both the source and destination are on the same subnet. This was the case at Tune, with both the consumer and the datastore instances living happily inside our Virtual Private Cloud. Rule #9 demands a deterministic sort order... completely breaking DNS round-robin!
Since then, the IETF has acknowledged this undesired behavior of RFC 3484. There are also some nice write-ups available on this issue, including this one from Daniel Stenberg.
Once we knew the root cause, we hacked our consumer code to shuffle the IPs coming from DNS. It was gross but let us move forward. With round-robin working, our load was neatly splayed over two cores on each datastore server, the network processing bottleneck was eliminated, and the global throughput bottleneck on the backend was removed.