I have a __mask64 as a result of a few AVX512 operations:
__mmask64 mboth = _kand_mask64(lres, hres);
I would like to count the number of nibbles in this that have all bits set (0xF).
The simple solution is to do this:
uint64 imask = (uint64)mboth;
while (imask) {
if (imask & 0xf == 0xf)
ret++;
imask = imask >> 4;
}
I wanted something better, but what I came up with doesn't feel elegant:
//outside the loop
__m512i b512_1s = _mm512_set1_epi32(0xffffffff);
__m512i b512_0s = _mm512_set1_epi32(0x00000000);
//then...
__m512i vboth = _mm512_mask_set1_epi8(b512_0s, mboth, 0xff);
__mmask16 bits = _mm512_cmpeq_epi32_mask(b512_1s, vboth);
ret += __builtin_popcount((unsigned int)fres);
The above puts a 0xff byte into a vector where a 1 bit exists in the mask, then gets a 1-bit in the bits mask when the blown-up 0xf nibbles now are now found as 0xffffffff int32's.
I feel that two 512-bit operations are way overkill when the original data lives in a 64-bit number. This alternate is probably much worse; it's too many instructions and still operates on 128 bits:
//outside the loop
__m128i b128_1s = _mm_set1_epi32(0xffffffff);
//then...
uint64 maskl = mboth & 0x0f0f0f0f0f0f0f0f;
uint64 maskh = mboth & 0xf0f0f0f0f0f0f0f0;
uint64 mask128[2] = { (maskl << 4) | maskl, (maskh >> 4) | maskh };
__m128i bytes = _mm_cmpeq_epi8(b128_1s, *(__m128i*)mask128);
uint bits = _mm_movemask_epi8(bytes);
ret += __builtin_popcount(bits);
With just some scalar operations you can do this:
The first two steps put, for every nibble, the horizontal AND of the bits of that nibble in the most significant bit of that nibble. The other bits of the nibble become something that we don't want here so we just mask them out. Then popcount the result.
The shifts could go to the right (as in an earlier version of this answer) or they could be rotates, whichever works out best.
Clang makes efficient asm from this version, with no wasted instructions other than xor-zeroing ahead of
popcntwhich should go away with inlining since it canpopcnt same,sameeven without planning ahead to have the result in EAX for the calling convention.GCC does ok, but reorders the
&= masklast so it's part of the critical-path latency, not in parallel with the shifts, despite our best efforts to make the source look like single asm operations to try to hand-hold it into making better asm.MSVC is weird with this, turning it into right shifts, as well as doing the
&= masklast like GCC.This version also prevents clang from deoptimizing a shift or rotate into
lea reg, [0 + reg*4]which is 8 bytes long and has 2-cycle latency on Alder Lake / Sapphire Rapids. (https://uops.info/).Godbolt for this and several other versions (including a portable version of chtz's ADD/ADC trick). Using
asm("" : "+r"(imask))at a certain point in the function can force GCC not to deoptimize the order of operations, but that could stop it optimizing this as part of a larger loop.Writing this with multiple operations on the same source line doesn't hurt anything for Clang, and doing it this way still didn't stop GCC from screwing it up, but this does illustrate what optimal asm should be like. You might prefer to compact it back up into fewer C statements.
GCC's reordering to group shift-and-AND together is useful in general for AArch64, where
and x1, x0, x0, lsr 2is possible. But even then, instruction-level parallelism would be possible while still only using 3 AND instructions, two with shifted operands. GCC/Clang/MSVC miss that optimization. AArch64 repeated-pattern immediates for bitwise instructions do allow0x2222222222222222or0x8888888888888888, so no separate constant setup is needed.