I'm implementing conversions between SSE types and I found that implementing int8->int64 widening conversion for pre-SSE4.1 targets is cumbersome.
The straightforward implementation would be:
inline __m128i convert_i8_i64(__m128i a)
{
#ifdef __SSE4_1__
    return _mm_cvtepi8_epi64(a);
#else
    a = _mm_unpacklo_epi8(a, a);
    a = _mm_unpacklo_epi16(a, a);
    a = _mm_unpacklo_epi32(a, a);
    return _mm_srai_epi64(a, 56); // missing instrinsic!
#endif
}
But since _mm_srai_epi64 doesn't exist until AVX-512, there are two options at this point:
- implementing 
_mm_srai_epi64, or - implementing 
convert_i8_i64in a different way. 
I'm not sure which one would be the most efficient solution. Any idea?
                        
The unpacking intrinsics are used here in a funny way. They "duplicate" the data, instead of adding sign-extension, as one would expect. For example, before the first iteration you have in your register the following
If you convert
aandbto 16 bits, you should get this:Here
AandBare sign-extensions ofaandb, that is, both of them are either 0 or -1.Instead of this, your code gives
And then you convert it to the proper result by shifting right.
However, you are not obliged to use the same operand twice in the "unpack" intrinsics. You could get the desired result if you "unpacked" the following two registers:
That is:
(if that
_mm_srai_epi8intrinsic actually existed)You can apply the same idea to the last stage of your conversion. You want to "unpack" the following two registers:
To get them, right-shift the 32-bit data:
So the last "unpack" is