Replaced the non-accelerated rgb to ycbcr encoder (rfx_encode.c) to use 32-bit
integer multiplication with shifted factors: 2 times faster
The accelerated SSE2 rgb to ycbcr encoder (rfx_sse2.c) was completely changed
and simplified in order to make use of the SSE2 signed 16-bit integer
multiplication: 2 times faster
Also modified the non-accelerated ycbcr to rgb decoder (rfx_encode.c) to use
32-bit integer multiplications with shifted factors instead of floating point
multiplications: 3 times faster
The current ycbcr decoder was loosing some bits because cr/cb was multiplied by
the shifted factors.
Instead one should multiply by the non-shifted factors and shift the result.
The effects of these lost bits are easily seen by comparing the colors of a
RemoteFX session with the colors of a plain RDP session - they are just wrong ;)
I've replaced the bit-magic from the non non-accelerated version (rfx_decode.c)
and replaced it with simple float multiplications using the compiler's implicit
integer conversions. On several test machines this was even a little bit faster.
The accelerated SSE2 ycbcr decoder (rfx_sse2.c) was completely changed in order
to make use of the SSE2 signed 16-bit integer multiplication.
Fortunately the factors in the conversion matrix are so small that we can
easily shift them to the maximum possible 16-bit signed integer value without
loosing any information and use _mm_mulhi_epi16 which takes the upper 16 bits
of the 32-bit result.
The SSE2 ycbcr decoder is now much simpler and about 40 percent faster.