I have a peculiar requirement that needs to be fulfilled efficiently. (SIMD, perhaps?)
src is an array of bytes. Every group of 4 bytes in the array need to be processed as:
- Multiply low nibble of
src[0]by a constant numberA. - Multiply low nibble of
src[1]by a constant numberB. - Multiply low nibble of
src[2]by a constant numberC. - Multiply low nibble of
src[3]by a constant numberD.
Sum the four parts above to give result.
Move on to next 4 set of bytes and re-compute result (rinse & repeat till end of byte array).
result is guaranteed to be small (even fit in a byte) owing to all numbers involved being very small. However, the data type for result can be flexible to support an effecient algorithm.
Any suggestions / tips / tricks to go faster than the following pseudo-code?:
for (int i=0; i< length; i+=4)
{
result = (src[i] & 0x0f) * A + (src[i+1] & 0x0f) * B + (src[i+2] & 0x0f) * C + (src[i+3] & 0x0f) * D;
}
BTW, result then forms an index into a higher-order array.
This particular loop is so crucial that implementation language is no barrier. Can choose language out of C#, C or MASM64
Here’s an example how to do that efficiently with SSE intrinsics.
The code uses
pmaddubswSSSE3 instruction for multiplication and the first step of the reduction, then adds even/odduint16_tlanes in the vector.The above code assumes your ABCD numbers are unsigned bytes. If they are signed, you gonna need to flip order of arguments of
_mm_maddubs_epi16intrinsic and use different code for the second reduction step,_mm_slli_epi32( v, 16 ),_mm_add_epi16,_mm_srai_epi32( v, 16 )If you have AVX2 the upgrade is trivial, replace
__m128iwith__m256i, and_mm_somethingwith_mm256_something.If the length of your input is not necessarily a multiple of 4 groups, note you gonna need special handling for the final incomplete batch of numbers. Without
_mm_maskload_epi32AVX2 instruction, here's one possible way to load incomplete vector of 4-byte groups:P.S. Since you then gonna use these integers to index, note it only takes a few cycles of latency to extract integers from SSE vectors with
_mm_extract_epi32instruction.