I am try to writing a simple game and I need to study some x86 assemble for vector operation. Use xmm as 4 packed single-precision floating-point, are there any aggregate operations? Such as:
"MAXPS" to calculate the max of the 4 fp32. (used on Chebyshev Distance or so on)
"SUMPS" to calculate the sum of the 4 fp32. (used on dot product or vector magnitude)
One non-loopng, non-branching way to get the maximum float value of an SSE vector would be something like the following.
...and an AVX version is as follows.