RepRep - The Mystery of the Sort Benchmark Outliers

This is another post on weird bugs and mysteries. A few days ago I looked at contributing to Timely Dataflow. I dug into the repo's radix sort implementation and saw weird behavior when running the benchmarks:

test rsort_lsb_u32_20 ... bench:  13,115,873 ns/iter (+/- 861,527) = 319 MB/s
test rsort_lsb_u32_20 ... bench:  22,135,310 ns/iter (+/- 772,498) = 189 MB/s  <==
test rsort_lsb_u32_20 ... bench:  13,009,435 ns/iter (+/- 364,713) = 322 MB/s
test rsort_lsb_u32_20 ... bench:  13,008,352 ns/iter (+/- 713,115) = 322 MB/s
test rsort_lsb_u32_20 ... bench:  13,083,947 ns/iter (+/- 769,034) = 320 MB/s
test rsort_lsb_u32_20 ... bench:  21,942,109 ns/iter (+/- 978,852) = 191 MB/s  <==
test rsort_lsb_u32_20 ... bench:  13,088,477 ns/iter (+/- 785,538) = 320 MB/s
test rsort_lsb_u32_20 ... bench:  13,064,438 ns/iter (+/- 762,363) = 321 MB/s

On my desktop, an Intel Core i5-7600, about a fifth of the runs slowed from 320 MB/s to 190 MB/s. This was particularly confounding because the benchmark is deterministic.

I got a little obsessed with finding the root cause. After wrapping my head around the radix sort code, I profiled with perf_events using "perf record" and "perf report". But proportionally the two profiles looked similar.

Next I inspected performance counters with "perf stat -d". The mystery deepened when comparing a good run against the bad:

"good run"
test rsort_lsb_u32_20 ... bench:  13,082,895 ns/iter (+/- 802,286) = 320 MB/s

 Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20':

       3966,057471      task-clock (msec)         #    0,990 CPUs utilized
                39      context-switches          #    0,010 K/sec
                 2      cpu-migrations            #    0,001 K/sec
           189.429      page-faults               #    0,048 M/sec
    16.254.795.809      cycles                    #    4,098 GHz
    29.488.536.562      instructions              #    1,81  insn per cycle
     2.799.011.967      branches                  #  705,742 M/sec
         6.735.243      branch-misses             #    0,24% of all branches
     8.004.471.223      L1-dcache-loads           # 2018,244 M/sec
     1.168.268.824      L1-dcache-load-misses     #   14,60% of all L1-dcache hits
         9.072.863      LLC-loads                 #    2,288 M/sec
         1.280.500      LLC-load-misses           #   14,11% of all LL-cache hits

       4,006246558 seconds time elapsed


"bad run"
test rsort_lsb_u32_20 ... bench:  22,093,918 ns/iter (+/- 9,656,989) = 189 MB/s

 Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20':

       5055,207656      task-clock (msec)         #    0,937 CPUs utilized
                94      context-switches          #    0,019 K/sec
                 1      cpu-migrations            #    0,000 K/sec
           189.418      page-faults               #    0,037 M/sec
    21.962.826.744      cycles                    #    4,345 GHz
    29.493.221.138      instructions              #    1,34  insn per cycle
     2.799.928.144      branches                  #  553,870 M/sec
         6.728.506      branch-misses             #    0,24% of all branches
     8.005.459.112      L1-dcache-loads           # 1583,606 M/sec
     1.130.853.381      L1-dcache-load-misses     #   14,13% of all L1-dcache hits
         8.311.800      LLC-loads                 #    1,644 M/sec
         1.269.525      LLC-load-misses           #   15,27% of all LL-cache hits

       5,397361827 seconds time elapsed

Most counts are very close. But somehow the good runs manage ~1.8 instructions/cycle whereas the bad runs putter at ~1.3.

Despite the L1 load/miss ratios being similar, I guessed that alignment issues were coming into play. I hacked the Rust benchmark to dump the memory addresses of each vector's buffer and they were all at least 64-byte aligned. So even vectorized AVX512 instructions should have been fine. That idea was a bust.

Then I tried reproducing the behavior on different machines. I used three AWS instance types: c3.xlarge, c4.2xlarge, and c5.large. All of these use Intel Xeon CPUs and showed steady throughput across runs. (I did 20 runs per machine, but for brevity show only 3).

"c3.2xlarge"
test rsort_lsb_u32_20     ... bench:  24,280,929 ns/iter (+/- 2,554,741) = 172 MB/s
test rsort_lsb_u32_20     ... bench:  24,193,735 ns/iter (+/- 2,120,372) = 173 MB/s
test rsort_lsb_u32_20     ... bench:  24,119,255 ns/iter (+/- 2,040,252) = 173 MB/s
...

"c5.large"
test rsort_lsb_u32_20     ... bench:  14,928,288 ns/iter (+/- 555,767) = 280 MB/s
test rsort_lsb_u32_20     ... bench:  14,213,352 ns/iter (+/- 699,836) = 295 MB/s
test rsort_lsb_u32_20     ... bench:  14,908,216 ns/iter (+/- 425,255) = 281 MB/s
...

The c3 and c4 throughputs were about the same so only one is shown. The c5 family actually has a significant performance boost. Interestingly, my consumer Intel i5-7600 significantly outperforms the c5's Intel Xeon Platinum 8124M on single core performance at 280 MB/s vs 320 MB/s.

So the AWS instances didn't have performance outliers. The problem was specific to my home setup. I spent a few hours wading through the immense number of different counters exposed by perf_events, trying to understand why the instructions/cycle would sometimes vary so drastically. None of my endless perf runs showed significant differences.

I was about to give up.

There was one last counter I decided to try... hw_interrupts.received.

"good run"
test rsort_lsb_u32_20 ... bench:  13,041,410 ns/iter (+/- 760,375) = 321 MB/s

 Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20':

             1.161      hw_interrupts.received


"bad run"
test rsort_lsb_u32_20 ... bench:  21,963,625 ns/iter (+/- 878,996) = 190 MB/s

 Performance counter stats for '/home/nieksand/.cargo/bin/cargo bench rsort_lsb_u32_20':

             1.880      hw_interrupts.received

Success! I finally had a real difference between the two scenarios that I reproduce easily.

I'd love to say that I used some advanced technique to provably track down the culprit for the hardware interrupts. In reality just I took a guess that it was this guy:

Sure enough, after disabling my little USB wireless NIC the latency outliers went away:

test rsort_lsb_u32_20 ... bench:  13,078,868 ns/iter (+/- 805,109) = 320 MB/s
test rsort_lsb_u32_20 ... bench:  13,122,640 ns/iter (+/- 737,939) = 319 MB/s
test rsort_lsb_u32_20 ... bench:  13,057,573 ns/iter (+/- 736,634) = 321 MB/s
test rsort_lsb_u32_20 ... bench:  13,052,186 ns/iter (+/- 822,351) = 321 MB/s
test rsort_lsb_u32_20 ... bench:  13,107,309 ns/iter (+/- 772,251) = 319 MB/s
  ...
  (snipped 55 more runs)
  ...

So I'm calling the mystery solved. I mentioned my obsession tracking this down. When I started it was 10pm. Somehow it was 4:30am when I wrapped up.

Writings - The Mystery of the Sort Benchmark Outliers