Consider the following valarray-like class:
#include <stdlib.h>
struct va
{
void add1(const va& other);
void add2(const va& other);
size_t* data;
size_t size;
};
void va::add1(const va& other) {
for (size_t i = 0; i < size; ++i) {
data[i] += other.data[i];
}
}
void va::add2(const va& other){
for (size_t i = 0, s = size; i < s; ++i) {
data[i] += other.data[i];
}
}
The add2 function is vectorized for different compilers (MSVC, Clang, GCC, ICC), whereas add1 is not. See https://godbolt.org/z/c61qvrrbv
This is explained by potential aliasing: compilers cannot prove that one of elements pointed by data is not the size itself.
However, there's also potential overlap of elements pointed by data and other.data. For MSVC there is potential aliasing these elements and pointers themselves, as it does not take advantage of the Strict Aliasing Rule. This applies to both add1 and add2.
The compilers are performing checks for all possible aliasing they suspect and execute vectorized operation for add2.
Why aren't they adding more checks and having this optimization for add1?
It looks like compilers aren't able to realize (and don't add code to check) that
data[i]never points tothis->size. This would make the loop trip-count not calculable ahead of the first iteration, which is something no mainstream compiler except ICC can ever handle.Hopefully compilers can learn to check for possible overlap before applying their vectorization logic, like
(.data > this) || (.data+.size) < this; hopefully there's an efficient way to do that. They already invent code to check for overlap between two arrays inadd2.(The more checking code is required at startup, the more profitable vectorization has to be to pay for itself; 64-bit scalar elements aren't that bad with baseline x86-64 compared to 128-bit vectors especially when the compiler doesn't know from PGO that size is usually not small and that the loop is hot. I tried
gcc -march=icelake-clientand-march=znver4to not only enable AVX2 but also set tuning heuristics for CPUs with very good vector throughput and cache/memory bandwidth. But still no joy, so this possible aliasing is probably a full roadblock, not a heuristic decision.)Asm evidence for compilers being worried about aliasing
this->sizeNotice how GCC's loop branch is
cmp rax, QWORD PTR [rdi+8], whereraxholdsiand[rdi+8]isthis->size(x86-64 SysV calling convention), so it's reloading it every iteration. If we compile with-O3 -fno-tree-vectorize, we see GCC'sadd2loads the size into a register ahead of the loop, comparing against that inside the loop, i.e. hoisting the load. The fact that GCC doesn't do that withadd1is a pretty clear sign that it thinksdata[i] += ...can modifythis->size.Also, changing the type to
unsigned *data;or anything that can't point to asize_tlets GCC, Clang, and ICC auto-vectorize ofadd1. Using-fno-strict-aliasingdisables vectorization again. (And makes compilers extra "paranoid", reloadingthis->dataandother.dataevery iteration, like MSVC was always doing. Also breaking vectorization ofadd2for those compilers.)Changing the pointer type doesn't help MSVC because it doesn't do type-based aliasing analysis; it always acts like
gcc -fno-strict-aliasing. MSVC'sadd2already does check for more than just overlap of the pointed-to arrays, I think; probably some of that extra cmp/jcc is checking thatthis->data[i] += ...won't change the.datapointer inthisorother.std::vectordoesn't havesize_tmembers, usually avoiding thisstd::vector<size_t>wouldn't have this problem (at least in non-MSVC compilers) because type-based aliasing knows thatsize_t *can't point at another pointer.std::vectornormally stores three pointers:.begin,.end, andend_of_capacity, so the size information isend-begin, not a member storing size directly.For iterating over one array, normally it's at least as efficient to increment a pointer like
for (... ; ptr < endp ; ptr++) *ptrso you're not using indexed addressing modes. That's presumably whystd::vectoris normally laid out that way, rather than a pointer and twosize_tmembers.Some RISC machines don't even have two-register indexed addressing modes. For iterating two arrays, some CPUs will do better with fewer instructions just incrementing one index instead of two pointer increments, but it depends on the microarchitecture, e.g. some x86 CPUs un-laminate
add reg, [reg + reg]into 2 uops in the back-end, not keeping it micro-fused, especially with 3-operand AVX instructions.An efficient way (in asm) to loop over two arrays on CPUs with indexed addressing is to address one array relative to the other. It's UB to do this in C++, and would obfuscate your code, so it's something compilers should do for you.
sub rsi, rdioutside the loop (subtract pointers), then the loop body can bemov eax, [rsi + rdi](second array = difference + first) /add [rdi], eax(first array) /add rdi, 8(increment the pointer which is also the index for the other array.)MSVC will actually do this optimization which other compilers haven't picked up yet. (Godbolt)
Unfortunately, MSVC got it backwards and is using the two-register addressing mode as the memory source operand for
vpaddq. It'll un-laminate at issue/rename into the ROB on Intel Sandybridge-family, including at least Skylake, probably some later. Butvpaddd ymm1, ymm1, [rdx]would be 1 uop. The pure loadvmovdquwould always be 1 uop even with an indexed addressing mode.Indexed stores aren't ideal either (the store-address uop can't run on port 7 on Haswell / Skylake), and MSVC does avoid that. But it could get the best of both worlds by doing the pure load from
b[]with an indexed addressing mode, and then memory-sourcevpadd+ store with the simple addressing mode like[rdx+32].So MSVC did save some code-size and is getting some benefit in back-end throughput by needing only one increment of loop overhead, and in AGU port contention so this can run at close to 1 vector per clock with L1d cache hits to let it do 2 loads + 1 store every cycle (Intel's optimization manual suggests that Skylake can't fully sustain that for 32-byte load/store, for some unknown reason),
With an indexed addressing mode for the store like GCC and clang use, it could only run at 1 vector per 1.5 cycles on Haswell / Skylake. (Ice Lake has two load AGUs and two separate store AGUs, avoiding this bottleneck, but I don't know if unlamination of indexed addressing modes is still a thing on Ice Lake or Alder Lake.)
I don't know what's up with MSVC preferring
leafor incrementing a pointer. Or for preferringsub ecx/jneinstead of comparing against an end-pointer withleabefore the loop instead ofmov. Then the end of the loop could becmp rax, r8/jneor something, which would fuse into a single uop on AMD not just Intel.