sh0dan // VoxPod

Tuesday, October 13, 2009

Intrinsics in GCC

I have recently done quite a lot of assembler for Rawstudio, and when I found out that GCC also has support for SSE intrinsics, I finally set out to learn how to use them.

I had done quite a lot of inline assembler using GCC's AT&T syntax, and it works ok, though the syntax is pretty horrible. It also has some serious input restrictions, with only 5 general purpose registers available, when you do x86-32 versions. This wouldn't normally be a problem, but when you do simultaneous 32 and 64 bit versions you don't know the size of a pointer, so passing an array of pointers becomes very tedious.

So for the Rawstudio vertical resampler, I decided to take the plunge and look into assembler intrinsics. The first version was just to learn the basic syntax, and involved a rather naiive int -> float -> int conversion. The generated assembler on x86-64 was decent, and matched the reference (integer) performance. The second version was strictly integer, with 24 elements per pixel, and it far outperformed the C implementation.

One specific issue I encountered was a problem with doing SSE2 operation on 16 bit unsigned data, since there is no way of multiplying anything with more precision than 16 bit _signed_ data, but I will touch on that in a separate post.

Back to intrinsics, and I must say, that even though I have been very sceptic about it as a concept, I must admit, that it allows for much greater complexities, with very little efford. While you have to let go of the exact assembler generation, it does make C/C++ integration much easier, and you spend a lot less time chasing pointer errors and doing tedious loop code.

My next project was a much more ambitious DNG Color Profile processor. It involves an RGB -> HSV conversion, applying a trilinear interpolated 3D lookup table to the HSV data, processing Whitebalance, Exposure, Hue and Saturation, HSV -> RGB, so a very complex task. The reference implementation was completely done in float, so for starters, I thought I'd do the same.

The implementation processes four pixels in parallel, using one XMM register for each component. This proved to work very well, since you get both the advantages of planar (doing the same operations on all 4 components at the same time), and interleaved processing (have all components 'nearby').

I did however notice a few gotcha's:

1) Use _mm_set_X(a,b,c,d) sparingly.
GCC tends to use a "movss" combined with "pshuf" if a = b = c = d, and a combination of "mov" + unpack if they are not. If you are using contants, write them an an aligned variable and use _mm_load_X(ptr) instead, that has a much shorter dependency chain.

The only case where I found _mm_set to be faster was to transfer lookup values to xmm registers.

2) GCC intrinsics on i386.
A rather silly thing about intrinsics in GCC is that they require the "-msse2" switch to be present when compiling on i386 machines. The problem with this is that this switch also allows GCC to emit SSE2 code from ordinary C code, which will obviously crash on non SSE2 capable machines. My good friend Anders suggested that we should put the SSE2 code in a separate C-file and link them together. While this workaround should be able to do it, it seems quite silly that you cannot do runtime detection of SSE2, and just go from there.

3) Debugging intrinsics
Coming from Visual Studio, debugging in GDB is a real pain in the ***. Futhermore it's support for intrinsics, or any assembler for that matter, is virtually non-existing. Breakpoints on intrinsics are largely ignored, you get no intrinsic name -> register map, etc. I had to re-sort to using printf's most of the time, though that was actually quite a bit easier in intrinsics, compared to inline assembler.

The generated 64 bit code looks quite nice, with good instruction pairing - the irony being that the only processor that doesn't operate out-of-order is the Intel Atom, which doesn't run 64 bit code.

Other than that, the 32 bit SSE2 code obviously look hideous, with frequent overflows to the stack, but to be honest the code wasn't designed for 8 XMM registers, so that's to be expected.

In the end, the assembler ended up at about twice the speed of regular C-kode - the rest is probably mostly because of the large number of table lookups, that doesn't get faster by doing SSE. I can't really see how I could have done this assembler in this time without intrinsics, because the sheer complexity.

Sunday, May 31, 2009

MDX WikiParser v1.1

Here is a revised version of the Wikipedia to mDict converter.

Download: Wikiparser v1.1.

Changes from v1.0 to 1.1:
  • New DB design. Faster, takes a bit more space.
  • Better CPU scaling.
  • Fixed locked schema name - now you truly can call your DB shema what you like.
  • Fixed table formatting bug.
  • Result is 200% faster indexing and 50% faster processing on Quad-Core.

Monday, May 25, 2009

Wikipedia for mDict - May 2009

Here are the updated Wikipedia for mDict.

Languages:
  • English
  • German
  • Spanish
  • Portuguese
  • Russian
  • Chinese
  • Japanese
  • Polish
  • Italian
  • Danish (direct download below)
Download from LegalTorrents.

Changes:
  • Title added to all pages.
  • Spanish version added.
  • Chinese version added.
  • English "Jumbo" edition added, with 2.3 million full articles.

Since the Danish wikipedia is so small, here is a direct download link for the Danish Wikipedia.

If you do not have access to BitTorrent download, here is a mini version of the English Wikipedia (155.000 articles, very abbreviated).

If you are interested in helping doing further updates, please contact me.

Thursday, February 12, 2009

Introducing RawSpeed

I have spent quite a lot of my spare time in the last two months creating a file loader for RAW files. My primary motivation, beside general interest was, that I was frustrated there wasn't any fast open source loaders out there.

So my main objectives were to make a very fast loader that worked for 75% of the cameras out there, and was able to decode a RAW file at close to the optimal speed. The last 25% of the cameras out there could be serviced by a more generic loader, or convert their images to DNG - which as a sidenote usually compresses better than your camera.

So far, I have support of the latest cameras from Canon, Nikon, Sony, Pentax and Olympus, besides DNG support. I plan to add Panasonic cameras to the mix, but that's about it. RawSpeed has been integrated into the upcoming version of Rawstudio, which is being maintained by my friend Anders Brander.

My natural choice of language was C++, and while I considered adding assembler, I haven't seem any obvious spots where it would help more than a few percent. That would also make it fairly easy to port, as a side-bonus. My C++ style is "object-oriented C", with carefully chosen STL use, inspired by AviSynth, so I knew it wouldn't hinder performance much.

Compared to other raw decoders out there, I decided from the start to completely forego file streams. The image must be fully loaded into memory before decoding can start. This would first of all avoid many system calls, non-serial IO, and futhermore makes it possible to have threaded IO and decoding, by having an IO thread.

I feel like I must mention the inspiration. Obviously the closest relative is libopenraw. It seems like a nice project. I considered contributing to the project, but since I had already made a TIFF parser, and making it "streamless" would make it almost a complete rewrite, I decided I might as well write my own. At least then I cannot blame bugs on other people.

The other wellknown project out there is Dave Coffin's dcraw. David has put an incredible amount of work into reverse-engineering the strangest formats out there and creating a solid application that works for jsut about any camera out there. While he has a radically different coding style than most of what you find out there, the amount of work and dedication put into this project is very admirably.

So far the only thing you can see it the Subversion repository, but I will make a formal release, when the time is right. If you want to follow the development, you can subscribe to the rawstudio commit mailing list, where my updates are also posted.

That's it for now. Next time I hope to be able to tell you a bit about what the library actually contains.

Thursday, October 16, 2008

MDX Wikiparser 1.0

Here is version 1.0 of my wiki parser. It is still rather rough, but since you can download the most popular wikipedias below, it is only here for reference.

Requires JRE or JDK 1.5 or later to be installed.
Requires MySQL 5.0 or later installed.

Create a new shema in your database called wikindex
or something similar.

To run the project from the command line, go to the dist folder and
type the following:

java -jar "WikiParser.jar" [parameters] "InputFile.xml.bz2" "OutputFile.txt"

I have also included a sample bat file you can use as a basis for your own conversions. Be sure to adjust --databaseurl --user: and --password, if needed.

For fast indexing, you should have your database on a ramdisk, or an SSD, if you have one. You can find a good free ramdisk for 2000, XP and Vista here. 500MB should be enough for the english wikipedia.

Download WikiParser v1.0.