x86 assembly encounter

2009-12-04

Every couple of years I have an encounter with assembly programming. It's funny how rules that applied years ago are useless now. The most recent encounter lasted about two weeks and resulted in a 600x speedup in a critical function. But, all wasn't nice and rosy: it was more difficult than I initially planned, took more time and provided a few surprises.

Key takeaway points, so that I can remember them and so that people googling for answers may find them:

If you're looking for the PSRLB (parallel shift right logical bytes) SSE instruction, it isn't there. But there are two ways around it: you can either shift words using PSRLW and then mask out the higher bits, or for shifts with a count of one, use (xmm14 contains 1 in every byte and xmm15 is 0):

   psubusb xmm0, xmm14
   pavgb xmm0, xmm15

If you need to "horizontally" sum 16 bytes in an XMM register, you will find that the PHADDB instruction doesn't exist, either. There is PHADDW and you could use that in combination with PMADDUBSW (multiply-add bytes to words), but the resulting sequence of instructions is far from optimal. Fortunately, there is a trick: use PSADBW. This computes the sum of absolute differences, which if you use 0 as the source parameter will correspond to your sum, and stores it in two quadwords, which gets you halfway there. In my case, I simply accumulated the results using two quadwords per register, and combined them at the end.

There is a nice PMOVMSKB instruction which converts a byte mask to a bit mask. But why, oh why isn't there an instruction which does the opposite? Extracting a 16-bit mask to a 16-byte register turns out to be painful.

The last time I programmed in x86 assembly was using a Pentium 4 with the infamous NetBurst architecture. It was an ugly, unpredictable beast, where a mispredicted branch could cost you a fortune in performance terms. It seems that with the newer Nehalem chips Intel really got things right -- latencies for most instructions are small and predictable and overall performance is more consistent across the board. There are fewer traps. And unaligned data accesses aren't penalized as badly as before!

LOOP is slower than

   sub rcx, 1
   jnz .loop

Go figure.

Thank God and AMD for FINALLY adding registers. Back in the P4 days it was ridiculous: having a 3GHz processor with only 6 usable general-purpose registers and 8 SIMD ones sounded like a joke.

And the final observation: just as several years ago, the state of x86 assemblers is a sad, sad affair. To use a construction industry metaphor, an average x86 assembler has the complexity and usefulness of a hammer, while the DSP world is using high-speed mag-rail blast-o-matic nail guns with automatic feeders and superconducting magnets. I mean, seriously, do I really have to manually track register allocations?! Manually reorder instructions and measure performance to see which arrangement is faster (hoping not to break any dependencies)? Manually update stack pointer offsets after pushing something onto the stack? Write prologs and epilogs for C-linkable functions myself?

If anybody is thinking about writing or improving an x86 assembler, take a look at what Texas Instruments provides for their DSPs. See how you can write "linear assembly" and have your compiler schedule VLIW execution units for you. See how you don't need a piece of paper with a huge table detailing which registers are used in which part of your code.

I find it ridiculous that the most popular computing platform in the world does not have a decent assembler. What's even worse, from the discussions I've seen on the net, people are mostly interested in how fast the assembler is (?!) rather than how much time it saves the programmer.

Anyway, the net result of this encounter is a function that is about 600x faster than the original C implementation. It is about 4x slower than the theoretical limit (calculated assuming only arithmetic ops, no overhead, no memory accesses, and 16 ops per cycle), which I'm very happy with.

x86 assembly, see you in several years!

UPDATE (22.12.2009): I wrote this post hoping that it will help people searching for the non-existing PSRLB instruction -- and it worked -- I can already see it in the logs!

Comments

The x86 architecture is completely different TI DSPs' architecture. x86 was grown from a simple 16-bit processor 30 years ago to what is is today while maintaining backward compatibility and is a general purpose processor. TI DSPs are for digital signal processing with a much more thought out architecture. In addition to being an ugly hackjob of any architecture, there are lots of high level languages that people use for x86, but the TI DSP is likely to be programmed in assembly. This means that there isn't much of a demand for a powerful assembler for x86 except when people need the best performance possible. When that's the goal you need to do things like reorder instructions and benchmark since each CPU performs differently. So, while there may be some improvements possible in an x86 assembler there isn't much demand and comparisons to TI's DSP assemblers isn't very useful.

Alan2009-12-04

I agree that the state of assemblers is fairly sad, but why are you not using intrinsics? They give you all the scheduling and register allocation goodness that you're looking for, and you get to mix in C code for bits that the compiler handles well.

sdt2009-12-04

You may want to check out the ASM abstraction layer that x264 uses: http://x264dev.multimedia.cx/?p=191

This has a bunch of nifty features to take some of the pain out of x86 ASM programming. Plus, it aims for easy portability between 32- and 64-bit ASM on Linux, Mac, and Windows. I solves some (but not all) of the problems you outlined.

Multimedia Mike2009-12-04

Alan: you made many valid points, but I disagree with the conclusion. As I wrote, I don't see any reason why I should do things like register allocation, stack pointer management, or function epilog/prolog. I gave TI tools as an example of something that is assembly, but on a higher level.

sdt: intrinsics are fine if you're writing something relatively simple, but they are much less useful if your function is complex. In my case it was. Plus, I find it really annoying that they have different names than assembly mnemonics.

Multimedia Mike: oh, I do use that abstraction layer, and I was dismayed to find how thin it is and how little it does! One of my first debugging problems had to do with me not saving rbx in my function (rbx is callee-saved on 64-bit UNIX) -- I fully expected the abstraction layer to do that, given that it requires me to specify how many registers I intend to use. As for portability, I didn't much care, as my only target is 64-bit MacOS and 64-bit Linux anyway, and these systems use the same calling conventions.

It's a step in the right direction, but a really tiny one, unfortunately.

Jan Rychter2009-12-04

"LOOP is slower than..."
I think that part of why was that the CPU vendors have to workaround a timing loop in Windows 95.

Yuhong Bao2009-12-11

There are many timing loop for Win 95.

law of attraction2010-02-06