intel alderlake performance degradation after spin wait

131 views Asked by At

I'm tunning my program for low-latency.

I have a tight calculation function calc(); which is using SIMD floating point instructions heavily.

I had test the performance of calc(); using perf command. it shows that this calc function is using ~10k instructions and ~5k cpu cycles in average.

However, when I put this calc function after a spin-wait like

while(true) {
  if (!flag.load(std::memory_order_acquire)) {
      continue;
  }

  calc();
}
 

the calc part is using about 10k cycles. and other perf counters like l1d-cache-misses, llc-misses, branch-misses and instructions remain the same.

Can anyone help me to explain how this happened and what should I do to avoid this? I mean to keep the calc function as fast as possible.

Also, I have 2 interesting findings:

  1. If I got the flag variable set in a very short period(less than 1ms). I cannot notice any performance degradation for function calc.

  2. if I add some garbage simd floating point calcution in the middle of spin-wait. I can achieve the expected performance.


My CPU is 13900K. I also tested at 12900K and Ice Lake CPUs like Xeon 8368. looks they have the same behaviour.

I noticed from Optimization Reference Manual that there's something called Thread Director which can automatically detect the thread classes in runtime and there's a special class called Pause (spin-wait) dominated code. I don't know if this is related but looks like after some time period, the CPU detected that the thread is in a spin-wait loop and then reduced the resource that is allocated to this thread ?

Update

I'm testing on a redhat real-time kernel. closed efficient core from bios, set cpu affinity to a specic core id and set sechudle as FIFO and priority to 99. Also I have blocked all interrupts as I can. and reduce local-timer-interrupts to 1 once a second.

I also tried to add _mm_pause() in the middle of spin loop(as suggessted from Optimization Reference Manual). but it not help.

I bought the 13900k server from a special vendor and used liquid coding system. overlocked all 8 performance core to 5.8GHz. the boot command line of system is

# cat /proc/cmdline
BOOT_IMAGE=(hd0,gpt2)/vmlinuz-4.18.0-425.13.1.rt7.223.el8_7.x86_64 root=/dev/mapper/rhel-root ro crashkernel=auto rd.lvm.lv=rhel/root rhgb quiet isolcpus=0-6 rcu_nocbs=0-6 spectre_v2=off mitigations=off iommu=off intel_iommu=off tsc=reliable pcie_port_pm=off ipv6.disable=1 ipmi_si.force_kipmid=0 acpi_irq_nobalance rcu_nocb_poll clocksource=tsc selinux=0 intel_pstate=disable pcie_aspm=performance nosoftlockup audit=0 nmi_watchdog=0 mce=ignore_ce nohz=on intel_idle.max_cstate=0 processor.max_cstate=0 idle=poll transparent_hugepage=never hpet=disabled noht nohz_full=0-6 skew_tick=1
0

There are 0 answers