Collapse __mask64 aka 64-bit integer value, counting nibbles that have all bits set?

193 views Asked by At

I have a __mask64 as a result of a few AVX512 operations:

__mmask64 mboth = _kand_mask64(lres, hres);

I would like to count the number of nibbles in this that have all bits set (0xF).

The simple solution is to do this:

uint64 imask = (uint64)mboth;
while (imask) {
    if (imask & 0xf == 0xf)
        ret++;
    imask = imask >> 4;
} 

I wanted something better, but what I came up with doesn't feel elegant:

    //outside the loop
    __m512i b512_1s = _mm512_set1_epi32(0xffffffff);
    __m512i b512_0s = _mm512_set1_epi32(0x00000000);

    //then...
    __m512i vboth = _mm512_mask_set1_epi8(b512_0s, mboth, 0xff);
    __mmask16 bits = _mm512_cmpeq_epi32_mask(b512_1s, vboth);
    ret += __builtin_popcount((unsigned int)fres);

The above puts a 0xff byte into a vector where a 1 bit exists in the mask, then gets a 1-bit in the bits mask when the blown-up 0xf nibbles now are now found as 0xffffffff int32's.

I feel that two 512-bit operations are way overkill when the original data lives in a 64-bit number. This alternate is probably much worse; it's too many instructions and still operates on 128 bits:

    //outside the loop
    __m128i b128_1s = _mm_set1_epi32(0xffffffff);

    //then...
    uint64 maskl = mboth & 0x0f0f0f0f0f0f0f0f;
    uint64 maskh = mboth & 0xf0f0f0f0f0f0f0f0;
    uint64 mask128[2] = { (maskl << 4) | maskl, (maskh >> 4) | maskh };
    __m128i bytes   = _mm_cmpeq_epi8(b128_1s, *(__m128i*)mask128);
    uint bits = _mm_movemask_epi8(bytes);
    ret += __builtin_popcount(bits);
3

There are 3 answers

7
harold On BEST ANSWER

With just some scalar operations you can do this:

imask &= imask << 2;
imask &= imask << 1;
ret += std::popcount(imask & 0x8888888888888888);

The first two steps put, for every nibble, the horizontal AND of the bits of that nibble in the most significant bit of that nibble. The other bits of the nibble become something that we don't want here so we just mask them out. Then popcount the result.

The shifts could go to the right (as in an earlier version of this answer) or they could be rotates, whichever works out best.


Clang makes efficient asm from this version, with no wasted instructions other than xor-zeroing ahead of popcnt which should go away with inlining since it can popcnt same,same even without planning ahead to have the result in EAX for the calling convention.

GCC does ok, but reorders the &= mask last so it's part of the critical-path latency, not in parallel with the shifts, despite our best efforts to make the source look like single asm operations to try to hand-hold it into making better asm.

MSVC is weird with this, turning it into right shifts, as well as doing the &= mask last like GCC.

// Clang compiles this optimally, hopefully also when inlining
// GCC still does the & mask last, on the critical path
// MSVC mangles this, with two right shifts despite the source going left, and deoptimizes latency like GCC
int count_nibbles(uint64_t imask)
{
    uint64_t mask = 0x2222222222222222;  // movabs, or hoisted out of a loop
    uint64_t shifted = imask << 1;   // LEA dst, [src+src] into a new reg
    shifted &= imask;                // AND
    shifted >>= 2;                   // SHR
    imask &= mask;                   // AND into original reg, in parallel with the shift/AND chain
    shifted &= imask;                // AND
    return std::popcount(shifted);   // POPCNT
}

This version also prevents clang from deoptimizing a shift or rotate into lea reg, [0 + reg*4] which is 8 bytes long and has 2-cycle latency on Alder Lake / Sapphire Rapids. (https://uops.info/).

Godbolt for this and several other versions (including a portable version of chtz's ADD/ADC trick). Using asm("" : "+r"(imask)) at a certain point in the function can force GCC not to deoptimize the order of operations, but that could stop it optimizing this as part of a larger loop.

Writing this with multiple operations on the same source line doesn't hurt anything for Clang, and doing it this way still didn't stop GCC from screwing it up, but this does illustrate what optimal asm should be like. You might prefer to compact it back up into fewer C statements.


GCC's reordering to group shift-and-AND together is useful in general for AArch64, where and x1, x0, x0, lsr 2 is possible. But even then, instruction-level parallelism would be possible while still only using 3 AND instructions, two with shifted operands. GCC/Clang/MSVC miss that optimization. AArch64 repeated-pattern immediates for bitwise instructions do allow 0x2222222222222222 or 0x8888888888888888, so no separate constant setup is needed.

8
njuffa On

[I would like to thank chtz for spotting a critical error in my original answer.]

We can adapt Alan Mycroft's null-byte detection algorithm, which generates a byte 0x80 for every null byte found, and a byte 0x00 otherwise. We need to make three modifications. Instead of byte-wise masks we use nibble-wide masks. Instead of null nibbles we look for 0xF nibbles by inverting all the nibbles first. And we need to prevent carry-out from nibbles to the next higher nibble as Mycroft's algorithm was designed to flag the least-significant null byte. Compiled with gcc 13.2 and flags -O3 -march=core-avx2 fairly effient code results (check it out on Compiler Explorer):

count_f_nibbles(unsigned long):
        movabs  rax, 8608480567731124087
        movabs  rdx, 1229782938247303441
        and     rax, rdi
        add     rax, rdx
        movabs  rdx, -8608480567731124088
        and     rax, rdi
        and     rax, rdx
        popcnt  rax, rax
        ret
#include <cstdio>
#include <cstdlib>
#include <cstdint>

int popcount64 (uint64_t x) 
{
    int r = 0;
    while (x) {
        x = x & (x - 1);
        r++;
    }
    return r;
}

/* Adapted from Alan Mycroft's null-byte detection algorithm 
   newsgroup comp.lang.c, 1987/04/08,
   https://groups.google.com/forum/#!original/comp.lang.c/2HtQXvg7iKc/xOJeipH6KLMJ
*/
int count_f_nibbles (uint64_t a)
{
    const uint64_t nibble_lsb = 0x1111111111111111ULL;
    const uint64_t nibble_msb = 0x8888888888888888ULL; 
    uint64_t t;
    a = ~a ;               // not-nibble: 0x0 where a has 0xf
    t = a |  nibble_msb;   // set nibble msbs to catch carry out from nibbles
    t = t -  nibble_lsb;   // msb = 0, if not-nibble was 0x0 or 0x8
    a = ~a & nibble_msb;   // extract msb, msb = 1 if not-nibble < 0x8
    t = a & ~t;            // msb = 1, if not-nibble was 0x0
    return popcount64 (t); // use std::popcount() where available!
}

int count_f_nibbles_ref (uint64_t imask)
{
    int ret = 0;
    while (imask) {
        if ((imask & 0xf) == 0xf) {
            ret++;
        }
        imask = imask >> 4;
    } 
    return ret;
}

/*
  https://groups.google.com/forum/#!original/comp.lang.c/qFv18ql_WlU/IK8KGZZFJx4J
  From: geo <[email protected]>
  Newsgroups: sci.math,comp.lang.c,comp.lang.fortran
  Subject: 64-bit KISS RNGs
  Date: Sat, 28 Feb 2009 04:30:48 -0800 (PST)

  This 64-bit KISS RNG has three components, each nearly
  good enough to serve alone.    The components are:
  Multiply-With-Carry (MWC), period (2^121+2^63-1)
  Xorshift (XSH), period 2^64-1
  Congruential (CNG), period 2^64
*/
static uint64_t kiss64_x = 1234567890987654321ULL;
static uint64_t kiss64_c = 123456123456123456ULL;
static uint64_t kiss64_y = 362436362436362436ULL;
static uint64_t kiss64_z = 1066149217761810ULL;
static uint64_t kiss64_t;
#define MWC64  (kiss64_t = (kiss64_x << 58) + kiss64_c, \
                kiss64_c = (kiss64_x >> 6), kiss64_x += kiss64_t, \
                kiss64_c += (kiss64_x < kiss64_t), kiss64_x)
#define XSH64  (kiss64_y ^= (kiss64_y << 13), kiss64_y ^= (kiss64_y >> 17), \
                kiss64_y ^= (kiss64_y << 43))
#define CNG64  (kiss64_z = 6906969069ULL * kiss64_z + 1234567ULL)
#define KISS64 (MWC64 + XSH64 + CNG64)

int main (void)
{
    uint64_t a, res, ref;
    for (uint64_t i = 0; i < 100000000; i++) {
        a = KISS64;
        res = count_f_nibbles (a);
        ref = count_f_nibbles_ref (a);
        if (res != ref) {
            printf ("error: a=%016llx res=%d ref=%d\n", a, res, ref);
            return EXIT_FAILURE;
        }
    }
    return EXIT_SUCCESS;
}

An alternative to the use of Mycroft's algorithm is to use yet another classical SIMD-in-a-register approach to per-subunit unsigned greater-than-or-equal comparison, by checking if each nibble is >= 0xF. This is ultimately based on the sum and carry bit vectors during addition, see comment for vcmpgeu_nibble().

/* Set per-nibble msb when per-nibble unsigned a >= b is true. Based on
   Peter L. Montgomery's observation (newsgroup comp.arch, 2000/02/11,
   https://groups.google.com/d/msg/comp.arch/gXFuGZtZKag/_5yrz2zDbe4J):
   (A+B)/2 = (A AND B) + (A XOR B)/2.
*/
uint64_t vcmpgeu_nibble (uint64_t a, uint64_t b)
{
    const uint64_t nibble_msb = 0x8888888888888888ULL;
    return ((a | ~b) - (((a ^ ~b) >> 1) & ~nibble_msb));
}

int count_f_nibbles_vcmpgeu (uint64_t a)
{
    const uint64_t nibbles_ff = 0xffffffffffffffffULL;
    const uint64_t nibble_msb = 0x8888888888888888ULL;
    a = vcmpgeu_nibble (a, nibbles_ff) & nibble_msb;
    return popcount64 (a);
}

The code gcc 13.2 generates for this is (see it at Compiler Explorer):

count_f_nibbles_vcmpgeu(unsigned long):
        movabs  rdx, 8608480567731124087
        mov     rax, rdi
        shr     rax
        and     rax, rdx
        sub     rdi, rax
        movabs  rax, -8608480567731124088
        and     rdi, rax
        xor     eax, eax
        popcnt  rax, rdi
        ret
3
chtz On

Alternative solution which uses add with carry

int count_nibbles_carry(uint64_t imask)
{
    uint64_t ones = 0x1111111111111111;
    uint64_t odd = imask & ones;
#if 1
    // use this to force inline assembly (gcc/clang syntax)
    asm ("add %2, %0\n\t"
         "adc $0, %0" 
    : "=r" (odd) 
    : "0" (odd), "r" (imask));
#else
    // backup code (but also incompatible with MSVC, I guess)
    __uint128_t sum_with_carry = __uint128_t(imask) + odd;
    uint64_t carry = sum_with_carry >> 64;
    odd = sum_with_carry + carry;
#endif

    return std::popcount(odd & ones);
}

The idea is to first check every nibble which has the lowest bit set (odd). If we add odd + imask we first set every lowest bit to zero, unless there is a carry from the previous nibble. Only the highest nibble might set the carry bit. This can be added with an adc $0, %rax to the (otherwise always unset) lowest bit. Then we have to mask away all except the lowest bits (re-using the first ones mask) and do a popcount.

Clang actually figures out to produce just an add and adc from the #else part, but it creates another xor rax,rax before the popcnt

I added the source to Peter's godbolt link (from a comment to harold's answer): https://godbolt.org/z/G9q3dqoon

And here is some test code using code by njuffa: https://godbolt.org/z/1oPM9f1f9