This is another post on weird bugs and mysteries. A few days ago I looked at contributing to Timely Dataflow. I dug into the repo's radix sort implementation and saw weird behavior when running the benchmarks:
test rsort_lsb_u32_20 ... bench: 13,115,873 ns/iter (+/- 861,527) = 319 MB/s
test rsort_lsb_u32_20 ... bench: 22,135,310 ns/iter (+/- 772,498) = 189 MB/s <==
test rsort_lsb_u32_20 ... bench: 13,009,435 ns/iter (+/- 364,713) = 322 MB/s
test rsort_lsb_u32_20 ... bench: 13,008,352 ns/iter (+/- 713,115) = 322 MB/s
test rsort_lsb_u32_20 ... bench: 13,083,947 ns/iter (+/- 769,034) = 320 MB/s
test rsort_lsb_u32_20 ... bench: 21,942,109 ns/iter (+/- 978,852) = 191 MB/s <==
test rsort_lsb_u32_20 ... bench: 13,088,477 ns/iter (+/- 785,538) = 320 MB/s
test rsort_lsb_u32_20 ... bench: 13,064,438 ns/iter (+/- 762,363) = 321 MB/s
On my desktop, an Intel Core i5-7600, about a fifth of the runs slowed from 320 MB/s to 190 MB/s. This was particularly confounding because the benchmark is deterministic.
I got a little obsessed with finding the root cause. After wrapping my head around the radix sort code, I profiled with perf_events using "perf record" and "perf report". But proportionally the two profiles looked similar.
Next I inspected performance counters with "perf stat -d". The mystery deepened when comparing a good run against the bad:
"good run"
test rsort_lsb_u32_20 ... bench: 13,082,895 ns/iter (+/- 802,286) = 320 MB/s
Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20':
3966,057471 task-clock (msec) # 0,990 CPUs utilized
39 context-switches # 0,010 K/sec
2 cpu-migrations # 0,001 K/sec
189.429 page-faults # 0,048 M/sec
16.254.795.809 cycles # 4,098 GHz
29.488.536.562 instructions # 1,81 insn per cycle
2.799.011.967 branches # 705,742 M/sec
6.735.243 branch-misses # 0,24% of all branches
8.004.471.223 L1-dcache-loads # 2018,244 M/sec
1.168.268.824 L1-dcache-load-misses # 14,60% of all L1-dcache hits
9.072.863 LLC-loads # 2,288 M/sec
1.280.500 LLC-load-misses # 14,11% of all LL-cache hits
4,006246558 seconds time elapsed
"bad run"
test rsort_lsb_u32_20 ... bench: 22,093,918 ns/iter (+/- 9,656,989) = 189 MB/s
Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20':
5055,207656 task-clock (msec) # 0,937 CPUs utilized
94 context-switches # 0,019 K/sec
1 cpu-migrations # 0,000 K/sec
189.418 page-faults # 0,037 M/sec
21.962.826.744 cycles # 4,345 GHz
29.493.221.138 instructions # 1,34 insn per cycle
2.799.928.144 branches # 553,870 M/sec
6.728.506 branch-misses # 0,24% of all branches
8.005.459.112 L1-dcache-loads # 1583,606 M/sec
1.130.853.381 L1-dcache-load-misses # 14,13% of all L1-dcache hits
8.311.800 LLC-loads # 1,644 M/sec
1.269.525 LLC-load-misses # 15,27% of all LL-cache hits
5,397361827 seconds time elapsed
Most counts are very close. But somehow the good runs manage ~1.8 instructions/cycle whereas the bad runs putter at ~1.3.
Despite the L1 load/miss ratios being similar, I guessed that alignment issues were coming into play. I hacked the Rust benchmark to dump the memory addresses of each vector's buffer and they were all at least 64-byte aligned. So even vectorized AVX512 instructions should have been fine. That idea was a bust.
Then I tried reproducing the behavior on different machines. I used three AWS instance types: c3.xlarge, c4.2xlarge, and c5.large. All of these use Intel Xeon CPUs and showed steady throughput across runs. (I did 20 runs per machine, but for brevity show only 3).
"c3.2xlarge"
test rsort_lsb_u32_20 ... bench: 24,280,929 ns/iter (+/- 2,554,741) = 172 MB/s
test rsort_lsb_u32_20 ... bench: 24,193,735 ns/iter (+/- 2,120,372) = 173 MB/s
test rsort_lsb_u32_20 ... bench: 24,119,255 ns/iter (+/- 2,040,252) = 173 MB/s
...
"c5.large"
test rsort_lsb_u32_20 ... bench: 14,928,288 ns/iter (+/- 555,767) = 280 MB/s
test rsort_lsb_u32_20 ... bench: 14,213,352 ns/iter (+/- 699,836) = 295 MB/s
test rsort_lsb_u32_20 ... bench: 14,908,216 ns/iter (+/- 425,255) = 281 MB/s
...
The c3 and c4 throughputs were about the same so only one is shown. The c5 family actually has a significant performance boost. Interestingly, my consumer Intel i5-7600 significantly outperforms the c5's Intel Xeon Platinum 8124M on single core performance at 280 MB/s vs 320 MB/s.
So the AWS instances didn't have performance outliers. The problem was specific to my home setup. I spent a few hours wading through the immense number of different counters exposed by perf_events, trying to understand why the instructions/cycle would sometimes vary so drastically. None of my endless perf runs showed significant differences.
I was about to give up.
There was one last counter I decided to try... hw_interrupts.received.
"good run"
test rsort_lsb_u32_20 ... bench: 13,041,410 ns/iter (+/- 760,375) = 321 MB/s
Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20':
1.161 hw_interrupts.received
"bad run"
test rsort_lsb_u32_20 ... bench: 21,963,625 ns/iter (+/- 878,996) = 190 MB/s
Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20':
1.880 hw_interrupts.received
Success! I finally had a real difference between the two scenarios that I reproduce easily.
I'd love to say that I used some advanced technique to provably track down the culprit for the hardware interrupts. In reality just I took a guess that it was this guy:
Sure enough, after disabling my little USB wireless NIC the latency outliers went away:
test rsort_lsb_u32_20 ... bench: 13,078,868 ns/iter (+/- 805,109) = 320 MB/s
test rsort_lsb_u32_20 ... bench: 13,122,640 ns/iter (+/- 737,939) = 319 MB/s
test rsort_lsb_u32_20 ... bench: 13,057,573 ns/iter (+/- 736,634) = 321 MB/s
test rsort_lsb_u32_20 ... bench: 13,052,186 ns/iter (+/- 822,351) = 321 MB/s
test rsort_lsb_u32_20 ... bench: 13,107,309 ns/iter (+/- 772,251) = 319 MB/s
...
(snipped 55 more runs)
...
So I'm calling the mystery solved. I mentioned my obsession tracking this down. When I started it was 10pm. Somehow it was 4:30am when I wrapped up.