Every couple of years I have an encounter with assembly programming. It's funny how rules that applied years ago are useless now. The most recent encounter lasted about two weeks and resulted in a 600x speedup in a critical function. But, all wasn't nice and rosy: it was more difficult than I initially planned, took more time and provided a few surprises.
Key takeaway points, so that I can remember them and so that people googling for answers may find them:
- If you're looking for the PSRLB (parallel shift right logical bytes) SSE instruction, it isn't there. But there are two ways around it: you can either shift words using PSRLW and then mask out the higher bits, or for shifts with a count of one, use (xmm14 contains 1 in every byte and xmm15 is 0):
psubusb xmm0, xmm14 pavgb xmm0, xmm15
- If you need to "horizontally" sum 16 bytes in an XMM register, you will find that the PHADDB instruction doesn't exist, either. There is PHADDW and you could use that in combination with PMADDUBSW (multiply-add bytes to words), but the resulting sequence of instructions is far from optimal. Fortunately, there is a trick: use PSADBW. This computes the sum of absolute differences, which if you use 0 as the source parameter will correspond to your sum, and stores it in two quadwords, which gets you halfway there. In my case, I simply accumulated the results using two quadwords per register, and combined them at the end.
- There is a nice PMOVMSKB instruction which converts a byte mask to a bit mask. But why, oh why isn't there an instruction which does the opposite? Extracting a 16-bit mask to a 16-byte register turns out to be painful.
- The last time I programmed in x86 assembly was using a Pentium 4 with the infamous NetBurst architecture. It was an ugly, unpredictable beast, where a mispredicted branch could cost you a fortune in performance terms. It seems that with the newer Nehalem chips Intel really got things right -- latencies for most instructions are small and predictable and overall performance is more consistent across the board. There are fewer traps. And unaligned data accesses aren't penalized as badly as before!
- LOOP is slower than
sub rcx, 1 jnz .loop
- Thank God and AMD for FINALLY adding registers. Back in the P4 days it was ridiculous: having a 3GHz processor with only 6 usable general-purpose registers and 8 SIMD ones sounded like a joke.
And the final observation: just as several years ago, the state of x86 assemblers is a sad, sad affair. To use a construction industry metaphor, an average x86 assembler has the complexity and usefulness of a hammer, while the DSP world is using high-speed mag-rail blast-o-matic nail guns with automatic feeders and superconducting magnets. I mean, seriously, do I really have to manually track register allocations?! Manually reorder instructions and measure performance to see which arrangement is faster (hoping not to break any dependencies)? Manually update stack pointer offsets after pushing something onto the stack? Write prologs and epilogs for C-linkable functions myself?
If anybody is thinking about writing or improving an x86 assembler, take a look at what Texas Instruments provides for their DSPs. See how you can write "linear assembly" and have your compiler schedule VLIW execution units for you. See how you don't need a piece of paper with a huge table detailing which registers are used in which part of your code.
I find it ridiculous that the most popular computing platform in the world does not have a decent assembler. What's even worse, from the discussions I've seen on the net, people are mostly interested in how fast the assembler is (?!) rather than how much time it saves the programmer.
Anyway, the net result of this encounter is a function that is about 600x faster than the original C implementation. It is about 4x slower than the theoretical limit (calculated assuming only arithmetic ops, no overhead, no memory accesses, and 16 ops per cycle), which I'm very happy with.
x86 assembly, see you in several years!
UPDATE (22.12.2009): I wrote this post hoping that it will help people searching for the non-existing PSRLB instruction -- and it worked -- I can already see it in the logs!