This is my first post covering weird bugs and mysteries. The story takes place back in 2014 while working at Tune. We were surprised by an outgoing bandwidth bill in Japan and everything pointed to AWS billing us incorrectly.
It started with line items for 47 TB (~$7800) of outgoing public bandwidth in AWS's Tokyo region:
We suspected the traffic was from Tune's analytics ingest fleet. These hosts receive event data from mobile apps instrumented with our SDK. They respond to each incoming request with a tiny (~20 byte) JSON payload:
{"accepted": true}
We divided the data transferred by request volume and it came out to several KB per request. Even allowing for a few hundred bytes of HTTP headers, the number just didn't make sense.
To verify where the traffic was coming from, we started digging into our Ganglia metrics. As expected, the only real traffic was coming from the analytics ingest hosts. That part of the system looked like this:
We reconciled the measured traffic between the ingest hosts and our internal routing tier (haproxy) and those numbers lined up. But they were over 40TB less than what AWS was claiming!
It was looking like a problem on AWS's side. But we wanted more proof. Just in case our Ganglia metrics were wrong, we pulled the CloudWatch metrics for those hosts too. Again, they all lined up. We even pulled a full day of ELB access logs and summed up the data size across all requests. Again, a perfect match. So at this point we had:
- Ganglia metrics for haproxy showing just a few TB of outgoing data
- Ganglia metrics for the analytics ingest hosts showing the same few TB
- CloudWatch metrics for all those hosts verifying Ganglia was correct
- ELB access log analysis again confirming the smaller data transfer number
- An estimate for response size which, multiplied against request volume, matched the above
We filed an AWS Support ticket. It bounced around the support organization a while without a convincing response, but our excellent AWS Account Manager got it escalated to the ELB team.
A few days later AWS had an answer for us.
So the data transfer we measured at the ELB logs was all at OSI Layer 7 (http). Of course, actual data transfer involves more than that. But what had we missed? SSL!
Those tiny 20 byte responses had over 4KB of SSL certificate chain attached. Since We terminated SSL connections at the ELB, we never saw the overhead on any internal bandwidth metrics. Once the ELB team pointed this out, it was easy to verify this using a pcap.
The SSL certificate chain looked something like this:
GoDaddy SHA1 Root Certificate | ~1.0 KiB (not actually needed) |
GoDaddy SHA2 Root Certificate | ~1.1 KiB |
GoDaddy SHA2 Intermediate Certificate | ~1.2 KiB |
Tune SHA2 Certificate | ~1.3 KiB |
Older Android devices lack SHA2 root certificates in their trust store. This means we must include the GoDaddy SHA2 root (+1.1 KiB) in our chain since it's signed by the legacy SHA1 cert that ancient phones do trust. The GoDaddy SHA1 root was included at the time but should not have been--clients rely on the version in their trusted store. This added another 1.0 KiB per message.
You can get this information using OpenSSL:
openssl s_client -connect target.website.com:443 -showcerts
The root cause is obvious in retrospect. We'd have caught it had we packet captured an actual request hitting our ingest endpoint. The large data overhead still surprises me given talk about SSL being essentially "free". (People also gloss over the extra round-trip latency from SSL handshake negotiation).
Back then AWS didn't expose any CloudWatch metric that would have shown you this bandwidth consumption. The omission was doubly annoying because it's what AWS actually bills on. They finally added that years later, in July 2017.
The "we" in this article refers to myself and a former Tune colleague, Chris Henry. Back then he was the Director of IT & Ops at Tune; these days he does Production Engineering at Facebook.