I was trying to how many 1 in 512MB memory and I found two possible methods, _mm_popcnt_u64() and __builtin_popcountll() in the gcc builtins.
_mm_popcnt_u64() is said to use the CPU introduction SSE4.2,which seems to be the fastest, and __builtin_popcountll() is excepted to use table lookup.
So, I think __builtin_popcountll() should be little slower than _mm_popcnt_u64().
However I got a result like this:
It took almost the same time for two methods. I highly doubt that they used the same way to work.
I also got this in popcntintrin.h
/* Calculate a number of bits set to 1. */
extern __inline int __attribute__((__gnu_inline__, __always_inline__, __artificial___))
_mm_popcnt_u32 (unsigned int __X)
{
return __builtin_popcount (__X);
}
#ifdef __x86_64__
extern __inline long long __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_popcnt_u64 (unsigned long long __X)
{
return __builtin_popcountll (__X);
}
#endif
So, I'm confused how __builtin_popcountll() works on earth

_mm_popcnt_u64is part of<nmmintrin.h>, a header devised by Intel for utility functions for accessing SSE 4.2 instructions.__builtin_popcountllis a GCC extension._mm_popcnt_u64is portable to non-GNU compilers, and__builtin_popcountllis portable to non-SSE-4.2 CPUs. But on systems where both are available, both should compile to the exact same code.