Microbenchmarks

Factors impacting benchmarks

Factors impacting Python benchmarks:

  • Linux Address Space Layout Randomization (ASRL), /proc/sys/kernel/randomize_va_space:
    • 0: No randomization
    • 1: Conservative randomization
    • 2: Full randomization
  • Python random hash function: PYTHONHASHSEED
  • Command line arguments and environmnet variables: enabling ASLR helps here (?)
  • CPU power saving and performance features: disable Intel Turbo Boost and/or use a fixed CPU frequency.
  • Temperature: temperature has a limited impact on benchmarks. If the CPU is below 95°C, Intel CPUs still run at full speed. With a correct cooling system, temperature is not an issue.
  • Linux perf probes: /proc/sys/kernel/perf_event_max_sample_rate
  • Code locality, CPU L1 instruction cache (L1c): Profiled Guided Optimization (PGO) helps here
  • Other processes and the kernel, CPU isolation (CPU pinning) helps here: use isolcpus=cpu_list and rcu_nocbs=cpu_list on the Linux kernel command line
  • ... Reboot? Sadly, other unknown factors may still impact benchmarks. Sometimes, it helps to reboot to restore standard performances.

Commands to check these factors:

  • python3 -m perf system show to show the system state
  • python3 -m perf system tune tunes the system to run benchmarks

Random performance of modern Intel CPU

https://github.com/cyring/corefreq

Intel CPUs

https://en.wikipedia.org/wiki/List_of_Intel_CPU_microarchitectures

  • 2006: Intel Core
  • 2008: Nehalem (NHM)

** 2010: Gulftown or Westmere-EP * 2008: Bonnel * 2011: Sandy Bridge (SNB) ** 2012: Ivy Bridge (IVB) * 2013: Silvermont * 2013: Haswell (HSW) * 2015: Skylake (SKL)

CPU Pipeline

Modern Intel CPUs don’t execute CISC machine instructions (complex instructions) but decode them into RISC instructions (simple instructions). The RISC instructions are also reordered to reduce the latency of memory load and store instructions.

Branch prediction

The CPU pipeline must decode and reorder instructions faster than the units executing instructions. To keep the CPU pipeline full, the CPU predicts branches (“if/else”). If its prediction is wrong, the CPU pipeline must be flushed and it has a cost on performance.

Linux perf

The Linux perf program gives access to low-level events:

  • stalled-cycles-frontend: “The cycles stalled in the front-end are a waste because that means that the CPU does not feed the Back End with instructions. This can mean that you have misses in the Instruction cache, or complex instructions that are not already decoded in the micro-op cache.”
  • stalled-cycles-backend: “The cycles stalled in the back-end are a waste because the CPU has to wait for resources (usually memory) or to finish long latency instructions (e.g. transcedentals - sqrt, logs, etc.).”

“Another stall reason is branch prediction miss. That is called bad speculation. In that case uops are issued but they are discarded because the BP predicted wrong.”

Source: https://stackoverflow.com/questions/22165299/what-are-stalled-cycles-frontend-and-stalled-cycles-backend-in-perf-stat-resul

Memory caches, L1, L2, L3, L4, MMU, TLB

Memory accesses are between slow and very slow compared to the speed of the CPU. To be efficient, there are multiple levels of caches: L1 (fastest, on the CPU die), L2, L3, and sometimes even L4 (slowest, but also the largest).

Applications don’t handle directly physical addresses of the memory but use “virtual” addresses. The MMU (Memory management unit) is responsible to convert virtual addresses to physical addresses. When the Linux kernel switches to a different application, the TLB (Translation lookaside buffer) cache of the MMU must be flushed.

Reliable micro benchmarks

Linux setup: isolate CPUs

See http://bugs.python.org/issue26275#msg259556.

Identify physical CPU cores (required for Intel Hyper-Threading CPUs):

$ lscpu --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       oui    5900,0000 1600,0000
1   0    0      1    1:1:1:0       oui    5900,0000 1600,0000
2   0    0      2    2:2:2:0       oui    5900,0000 1600,0000
3   0    0      3    3:3:3:0       oui    5900,0000 1600,0000
4   0    0      0    0:0:0:0       oui    5900,0000 1600,0000
5   0    0      1    1:1:1:0       oui    5900,0000 1600,0000
6   0    0      2    2:2:2:0       oui    5900,0000 1600,0000
7   0    0      3    3:3:3:0       oui    5900,0000 1600,0000

I have a single CPU on a single socket. We will isolate physical cores 2 and 3, and logical CPUs 2, 3, 6 and 7. Be also careful of NUMA: here all physical cores are on the same NUMA node (0).

Reboot, enter GRUB and modify the Linux command line to add:

isolcpus=2,3,6,7

If available on your kernel (CONFIG_NO_HZ=y and CONFIG_NO_HZ_FULL=y), you may also enable tickness kernel on these nodes. Add the following option to the command line:

nohz_full=2,3,6,7

Check that the Linux command line works:

$ cat /sys/devices/system/cpu/isolated
2-3,6-7
$ cat /sys/devices/system/cpu/nohz_full
2-3,6-7

Check stability of a benchmark

Download system_load.py: script to simulate busy system, run enough dummy workers until the system load is higher than the minimum specified on the command line.

  • Prefix benchmark command with taskset -c 2,3,6,7 to run the benchmark on isolated CPUs
  • Run the benchmark on an idle system
  • Run the benchmark with system_load.py 5 running in a different window

The two results must be close. Otherwise, CPU isolation doesn’t work.

You can also check the number of context switches by reading /proc/pid/status: read voluntary_ctxt_switches and nonvoluntary_ctxt_switches. It must be low on a CPU-bound benchmark.

Micro optimisation

  • Linux kernel: The problem with prefetch: “So the conclusion is: prefetches are absolutely toxic, even if the NULL ones are excluded.”
  • Linux kernel likely() / unlikely() based on GCC __builtin_expect()

Memory

  • What Every Programmer Should Know About Memory

Help compiler to optimize

  • const keyword?
  • aliasing: -fno-strict-aliasing or __restrict__

Aliasing