o combine colorloss recovery, chroma supersampling and color space conversion
in one step
o define private struct and hide the internal buffer
o make internal buffer reusable in the same session
o the decoded argb buffer can be reused to enhance performance
o pass width, height and bpp through nsc_process_message() call
o rename nsc_context_destroy to nsc_context_free and make it actually free the context
Replaced the non-accelerated rgb to ycbcr encoder (rfx_encode.c) to use 32-bit
integer multiplication with shifted factors: 2 times faster
The accelerated SSE2 rgb to ycbcr encoder (rfx_sse2.c) was completely changed
and simplified in order to make use of the SSE2 signed 16-bit integer
multiplication: 2 times faster
Also modified the non-accelerated ycbcr to rgb decoder (rfx_encode.c) to use
32-bit integer multiplications with shifted factors instead of floating point
multiplications: 3 times faster
The current ycbcr decoder was loosing some bits because cr/cb was multiplied by
the shifted factors.
Instead one should multiply by the non-shifted factors and shift the result.
The effects of these lost bits are easily seen by comparing the colors of a
RemoteFX session with the colors of a plain RDP session - they are just wrong ;)
I've replaced the bit-magic from the non non-accelerated version (rfx_decode.c)
and replaced it with simple float multiplications using the compiler's implicit
integer conversions. On several test machines this was even a little bit faster.
The accelerated SSE2 ycbcr decoder (rfx_sse2.c) was completely changed in order
to make use of the SSE2 signed 16-bit integer multiplication.
Fortunately the factors in the conversion matrix are so small that we can
easily shift them to the maximum possible 16-bit signed integer value without
loosing any information and use _mm_mulhi_epi16 which takes the upper 16 bits
of the 32-bit result.
The SSE2 ycbcr decoder is now much simpler and about 40 percent faster.