I have recently done quite a lot of assembler for
Rawstudio, and when I found out that GCC also has support for SSE intrinsics, I finally set out to learn how to use them.
I had done quite a lot of inline assembler using GCC's AT&T syntax, and it works ok, though the syntax is pretty horrible. It also has some serious input restrictions, with only 5 general purpose registers available, when you do x86-32 versions. This wouldn't normally be a problem, but when you do simultaneous 32 and 64 bit versions you don't know the size of a pointer, so passing an array of pointers becomes very tedious.
So for the Rawstudio
vertical resampler, I decided to take the plunge and look into assembler intrinsics. The first version was just to learn the basic syntax, and involved a rather naiive int -> float -> int conversion. The generated assembler on x86-64 was decent, and matched the reference (integer) performance. The second version was strictly integer, with 24 elements per pixel, and it far outperformed the C implementation.
One specific issue I encountered was a problem with doing SSE2 operation on 16 bit unsigned data, since there is no way of multiplying anything with more precision than 16 bit _signed_ data, but I will touch on that in a separate post.
Back to intrinsics, and I must say, that even though I have been very sceptic about it as a concept, I must admit, that it allows for much greater complexities, with very little efford. While you have to let go of the exact assembler generation, it does make C/C++ integration much easier, and you spend a lot less time chasing pointer errors and doing tedious loop code.
My next project was a much more ambitious
DNG Color Profile processor. It involves an RGB -> HSV conversion, applying a trilinear interpolated 3D lookup table to the HSV data, processing Whitebalance, Exposure, Hue and Saturation, HSV -> RGB, so a very complex task. The reference implementation was completely done in float, so for starters, I thought I'd do the same.
The implementation processes four pixels in parallel, using one XMM register for each component. This proved to work very well, since you get both the advantages of planar (doing the same operations on all 4 components at the same time), and interleaved processing (have all components 'nearby').
I did however notice a few gotcha's:
1) Use _mm_set_X(a,b,c,d) sparingly.
GCC tends to use a "movss" combined with "pshuf" if a = b = c = d, and a combination of "mov" + unpack if they are not. If you are using contants, write them an an aligned variable and use _mm_load_X(ptr) instead, that has a much shorter dependency chain.
The only case where I found _mm_set to be faster was to transfer lookup values to xmm registers.
2) GCC intrinsics on i386.
A rather silly thing about intrinsics in GCC is that they require the "-msse2" switch to be present when compiling on i386 machines. The problem with this is that this switch also allows GCC to emit SSE2 code from ordinary C code, which will obviously crash on non SSE2 capable machines. My good friend Anders suggested that we should put the SSE2 code in a separate C-file and link them together. While this workaround should be able to do it, it seems quite silly that you cannot do runtime detection of SSE2, and just go from there.
3) Debugging intrinsics
Coming from Visual Studio, debugging in GDB is a real pain in the ***. Futhermore it's support for intrinsics, or any assembler for that matter, is virtually non-existing. Breakpoints on intrinsics are largely ignored, you get no intrinsic name -> register map, etc. I had to re-sort to using printf's most of the time, though that was actually quite a bit easier in intrinsics, compared to inline assembler.
The generated 64 bit code looks quite nice, with good instruction pairing - the irony being that the only processor that doesn't operate out-of-order is the Intel Atom, which doesn't run 64 bit code.
Other than that, the 32 bit SSE2 code obviously look hideous, with frequent overflows to the stack, but to be honest the code wasn't designed for 8 XMM registers, so that's to be expected.
In the end, the assembler ended up at about twice the speed of regular C-kode - the rest is probably mostly because of the large number of table lookups, that doesn't get faster by doing SSE. I can't really see how I could have done this assembler in this time without intrinsics, because the sheer complexity.