I've been playing with the example from this presentation (slide 41).
It performs alpha blending as far as I'm concerned.
MOVQ mm0, alpha//4 16-b zero-padding α
MOVD mm1, A //move 4 pixels of image A
MOVD mm2, B //move 4 pixels of image B
PXOR mm3 mm3 //clear mm3 to all zeroes
//unpack 4 pixels to 4 words
PUNPCKLBW mm1, mm3 // Because B -A could be
PUNPCKLBW mm2, mm3 // negative, need 16 bits
PSUBW mm1, mm2 //(B-A)
PMULHW mm1, mm0 //(B-A)*fade/256
PADDW mm1, mm2 //(B-A)*fade + B
//pack four words back to four bytes
PACKUSWB mm1, mm3
I want to rewrite it in c with assembler.
For now, I have something like this:
void fade_mmx(SDL_Surface* im1,SDL_Surface* im2,Uint8 alpha, SDL_Surface* imOut)
{
int pixelsCount = imOut->w * im1->h;
Uint32 *A = (Uint32*) im1->pixels;
Uint32 *B = (Uint32*) im2->pixels;
Uint32 *out = (Uint32*) imOut->pixels;
Uint32 *end = out + pixelsCount;
__asm__ __volatile__ (
"\n\t movd (%0), %%mm0"
"\n\t movd (%1), %%mm1"
"\n\t movd (%2), %%mm2"
"\n\t pxor %%mm3, %%mm3"
"\n\t punpcklbw %%mm3, %%mm1"
"\n\t punpcklbw %%mm3, %%mm2"
"\n\t psubw %%mm2, %%mm1"
"\n\t pmulhw %%mm0, %%mm1"
"\n\t paddw %%mm2, %%mm1"
"\n\t packuswb %%mm3, %%mm1"
: : "r" (alpha), "r" (A), "r" (B), "r" (out), "r" (end)
);
__asm__("emms" : : );
}
When compiling I get this message: Error: (%dl) is not a valid base/index expression regarding the first line in assembler.
I suspect it's because alpha is Uint8, I tried casting it but then I get a segmentation fault. In the example, they're talking about 4 16-b zero-padding α which is not really clear to me.
Your problem is that you're trying to use your
alphavalue as an address instead of as a value. Themovd (%0), %%mm0instruction says to use%0as a location in memory. So you're saying to load the value pointed byalphainstead of its value. Usingmovd %0, %%mm0would solve that problem, but then you'd run into the problem that youralphavalue only has a 8-bit type and it needs to be a 32-bit type for it to work with the MOVD instruction. You can solve that problem and the fact thealphavalue needs to be multiplied by 256 and broadcast to all 4 16-bit words of the destination register for your algorithm to work by multiplying it by0x0100010001000100ULLand using the MOVQ instruction.However, you don't need the MOVD/MOVQ instructions at all. You can let the compiler load the values into MMX registers itself by specifying the
yconstraint with code like this:You'll notice that there's no need for a clobber list here because there's no registers being used that wasn't allocated by the compiler, and no other side effects that the compilers needs to know about. I've left out the necessary EMMS instruction, as you wouldn't want to executed on every pixel. You'll want to insert an
asm("emms");statement after your loop that blends the two surfaces.Better yet, you don't need to use inline assembly at all. You can use intrinsics instead, and not have to worry about the all the pitfalls of using inline assembly:
Similarly to the previous example you need call
_m_empty()after your loop to generate the necessary EMMS instruction.You should also seriously consider just writing the routine in plain C. Autovectorizers are pretty good these days, and it's likely the compiler can generate better code using modern SIMD instructions than what you're trying to do with ancient MMX instructions. For example, this code:
With GCC 10.2 and
-O3the above code results in assembly code that uses 128-bit XMM registers and blends 4 pixels at a time in its inner loop:Finally even an unvectorized version of the C code maybe near optimal, as the code is simple enough that you're probably going to be memory bound regardless how exactly the blend is implemented.