The following code produces assembly that conditionally executes SIMD in GCC 12.3 when compiled with -O3. For completeness, the code always executes SIMD in GCC 13.2 and never executes SIMD in clang 17.0.1.
#include <array>
__attribute__((noinline)) void fn(std::array<int, 4>& lhs, const std::array<int, 4>& rhs)
{
for (std::size_t idx = 0; idx != 4; ++idx) {
lhs[idx] = lhs[idx] + rhs[idx];
}
}
Here is the link in godbolt.
Here is the actual assembly from GCC 12.3 (with -O3):
fn(std::array<int, 4ul>&, std::array<int, 4ul> const&):
lea rdx, [rsi+4]
mov rax, rdi
sub rax, rdx
cmp rax, 8
jbe .L2
movdqu xmm0, XMMWORD PTR [rsi]
movdqu xmm1, XMMWORD PTR [rdi]
paddd xmm0, xmm1
movups XMMWORD PTR [rdi], xmm0
ret
.L2:
mov eax, DWORD PTR [rsi]
add DWORD PTR [rdi], eax
mov eax, DWORD PTR [rsi+4]
add DWORD PTR [rdi+4], eax
mov eax, DWORD PTR [rsi+8]
add DWORD PTR [rdi+8], eax
mov eax, DWORD PTR [rsi+12]
add DWORD PTR [rdi+12], eax
ret
I am very interested to know a) the purpose of the first 5 assembly instructions and b) if there is anything that can be done to cause GCC 12.3 to emit the code of GCC 13.2 (ideally, without manually writing SSE).
It seems GCC12 is treating the
classreference like it would a simpleint *, in terms of whetherlhsandrhscould partially overlap.Exact overlap would be fine, if
lhs[idx]is the same int asrhs[idx], we read it twice before writing it. But with partial overlap,rhs[3]for example could have been updated by one of thelhs[0..2]additions, which wouldn't happen with SIMD if we did all the loads first before any of the stores.GCC13 knows that class objects aren't allowed to partially overlap (except for common initial sequence stuff for different struct/class types, which I think doesn't apply here). That would be UB so it can assume it doesn't happen. GCC12's code-gen is a missed optimization.
So how do we help GCC12? The usual go-to is
__restrictfor removing overlap checks or enabling auto-vectorization at all when the compiler doesn't want to invent checks + a fallback. In C,restrictis part of the language, but in C++ it's only an extension. (Supported by the major mainstream compilers, and you can use the preprocessor to#defineit to the empty string on others.) You can use__restrictwith references as well as pointers. (At least GCC and Clang accept it with no warnings at-Wall; I didn't double-check the docs to be sure this is standard.)Or manually read all of
lhsbefore writing any of itSince your
arrayis small enough to fit in one SIMD register, there's no inefficiency in copying. This would be bad forarray<int, 1000>or something!Both of these compile to the same auto-vectorized asm as GCC13, with no wasted instructions (Godbolt)
Promising alignment (like
alignas(16)one one of the types?) could let it usepaddd xmm1, [rdi], a memory source operand, without AVX.