This version is only slightly faster than the code generated by gcc on
my i486, but it is almost twice as small. My i386 timing chart indicates
that this should be significantly faster than the gcc code on a i386.
Surprisingly, none of the code in the source tree actually use this routine.
But I optimized this routine for some image processing programs I wrote, and
I see no reason why everyone else shouldn't share the (admittedly) modest
benifits.