diff --git a/src/external/stb_image_resize2.h b/src/external/stb_image_resize2.h index e0c42824..ae8d730b 100644 --- a/src/external/stb_image_resize2.h +++ b/src/external/stb_image_resize2.h @@ -1,9 +1,9 @@ -/* stb_image_resize2 - v2.01 - public domain image resizing - - by Jeff Roberts (v2) and Jorge L Rodriguez +/* stb_image_resize2 - v2.10 - public domain image resizing + + by Jeff Roberts (v2) and Jorge L Rodriguez http://github.com/nothings/stb - Can be threaded with the extended API. SSE2, AVX, Neon and WASM SIMD support. Only + Can be threaded with the extended API. SSE2, AVX, Neon and WASM SIMD support. Only scaling and translation is supported, no rotations or shears. COMPILING & LINKING @@ -67,60 +67,60 @@ ADDITIONAL DOCUMENTATION MEMORY ALLOCATION - By default, we use malloc and free for memory allocation. To override the + By default, we use malloc and free for memory allocation. To override the memory allocation, before the implementation #include, add a: #define STBIR_MALLOC(size,user_data) ... #define STBIR_FREE(ptr,user_data) ... - Each resize makes exactly one call to malloc/free (unless you use the + Each resize makes exactly one call to malloc/free (unless you use the extended API where you can do one allocation for many resizes). Under address sanitizer, we do separate allocations to find overread/writes. PERFORMANCE This library was written with an emphasis on performance. When testing - stb_image_resize with RGBA, the fastest mode is STBIR_4CHANNEL with + stb_image_resize with RGBA, the fastest mode is STBIR_4CHANNEL with STBIR_TYPE_UINT8 pixels and CLAMPed edges (which is what many other resize - libs do by default). Also, make sure SIMD is turned on of course (default + libs do by default). Also, make sure SIMD is turned on of course (default for 64-bit targets). Avoid WRAP edge mode if you want the fastest speed. This library also comes with profiling built-in. If you define STBIR_PROFILE, - you can use the advanced API and get low-level profiling information by + you can use the advanced API and get low-level profiling information by calling stbir_resize_extended_profile_info() or stbir_resize_split_profile_info() after a resize. SIMD - Most of the routines have optimized SSE2, AVX, NEON and WASM versions. + Most of the routines have optimized SSE2, AVX, NEON and WASM versions. - On Microsoft compilers, we automatically turn on SIMD for 64-bit x64 and - ARM; for 32-bit x86 and ARM, you select SIMD mode by defining STBIR_SSE2 or + On Microsoft compilers, we automatically turn on SIMD for 64-bit x64 and + ARM; for 32-bit x86 and ARM, you select SIMD mode by defining STBIR_SSE2 or STBIR_NEON. For AVX and AVX2, we auto-select it by detecting the /arch:AVX - or /arch:AVX2 switches. You can also always manually turn SSE2, AVX or AVX2 + or /arch:AVX2 switches. You can also always manually turn SSE2, AVX or AVX2 support on by defining STBIR_SSE2, STBIR_AVX or STBIR_AVX2. On Linux, SSE2 and Neon is on by default for 64-bit x64 or ARM64. For 32-bit, we select x86 SIMD mode by whether you have -msse2, -mavx or -mavx2 enabled on the command line. For 32-bit ARM, you must pass -mfpu=neon-vfpv4 for both - clang and GCC, but GCC also requires an additional -mfp16-format=ieee to + clang and GCC, but GCC also requires an additional -mfp16-format=ieee to automatically enable NEON. On x86 platforms, you can also define STBIR_FP16C to turn on FP16C instructions for converting back and forth to half-floats. This is autoselected when we - are using AVX2. Clang and GCC also require the -mf16c switch. ARM always uses - the built-in half float hardware NEON instructions. + are using AVX2. Clang and GCC also require the -mf16c switch. ARM always uses + the built-in half float hardware NEON instructions. - You can also tell us to use multiply-add instructions with STBIR_USE_FMA. + You can also tell us to use multiply-add instructions with STBIR_USE_FMA. Because x86 doesn't always have fma, we turn it off by default to maintain determinism across all platforms. If you don't care about non-FMA determinism - and are willing to restrict yourself to more recent x86 CPUs (around the AVX + and are willing to restrict yourself to more recent x86 CPUs (around the AVX timeframe), then fma will give you around a 15% speedup. You can force off SIMD in all cases by defining STBIR_NO_SIMD. You can turn off AVX or AVX2 specifically with STBIR_NO_AVX or STBIR_NO_AVX2. AVX is 10% to 40% faster, and AVX2 is generally another 12%. - + ALPHA CHANNEL - Most of the resizing functions provide the ability to control how the alpha + Most of the resizing functions provide the ability to control how the alpha channel of an image is processed. When alpha represents transparency, it is important that when combining @@ -167,33 +167,33 @@ stb_image_resize expects case #1 by default, applying alpha weighting to images, expecting the input images to be unpremultiplied. This is what the - COLOR+ALPHA buffer types tell the resizer to do. + COLOR+ALPHA buffer types tell the resizer to do. - When you use the pixel layouts STBIR_RGBA, STBIR_BGRA, STBIR_ARGB, - STBIR_ABGR, STBIR_RX, or STBIR_XR you are telling us that the pixels are - non-premultiplied. In these cases, the resizer will alpha weight the colors - (effectively creating the premultiplied image), do the filtering, and then + When you use the pixel layouts STBIR_RGBA, STBIR_BGRA, STBIR_ARGB, + STBIR_ABGR, STBIR_RX, or STBIR_XR you are telling us that the pixels are + non-premultiplied. In these cases, the resizer will alpha weight the colors + (effectively creating the premultiplied image), do the filtering, and then convert back to non-premult on exit. When you use the pixel layouts STBIR_RGBA_PM, STBIR_RGBA_PM, STBIR_RGBA_PM, - STBIR_RGBA_PM, STBIR_RX_PM or STBIR_XR_PM, you are telling that the pixels - ARE premultiplied. In this case, the resizer doesn't have to do the - premultipling - it can filter directly on the input. This about twice as - fast as the non-premultiplied case, so it's the right option if your data is + STBIR_RGBA_PM, STBIR_RX_PM or STBIR_XR_PM, you are telling that the pixels + ARE premultiplied. In this case, the resizer doesn't have to do the + premultipling - it can filter directly on the input. This about twice as + fast as the non-premultiplied case, so it's the right option if your data is already setup correctly. - When you use the pixel layout STBIR_4CHANNEL or STBIR_2CHANNEL, you are - telling us that there is no channel that represents transparency; it may be - RGB and some unrelated fourth channel that has been stored in the alpha - channel, but it is actually not alpha. No special processing will be - performed. + When you use the pixel layout STBIR_4CHANNEL or STBIR_2CHANNEL, you are + telling us that there is no channel that represents transparency; it may be + RGB and some unrelated fourth channel that has been stored in the alpha + channel, but it is actually not alpha. No special processing will be + performed. - The difference between the generic 4 or 2 channel layouts, and the + The difference between the generic 4 or 2 channel layouts, and the specialized _PM versions is with the _PM versions you are telling us that the data *is* alpha, just don't premultiply it. That's important when using SRGB pixel formats, we need to know where the alpha is, because it is converted linearly (rather than with the SRGB converters). - + Because alpha weighting produces the same effect as premultiplying, you even have the option with non-premultiplied inputs to let the resizer produce a premultiplied output. Because the intially computed alpha-weighted @@ -201,10 +201,10 @@ than the normal path which un-premultiplies the output image as a final step. Finally, when converting both in and out of non-premulitplied space (for - example, when using STBIR_RGBA), we go to somewhat heroic measures to - ensure that areas with zero alpha value pixels get something reasonable - in the RGB values. If you don't care about the RGB values of zero alpha - pixels, you can call the stbir_set_non_pm_alpha_speed_over_quality() + example, when using STBIR_RGBA), we go to somewhat heroic measures to + ensure that areas with zero alpha value pixels get something reasonable + in the RGB values. If you don't care about the RGB values of zero alpha + pixels, you can call the stbir_set_non_pm_alpha_speed_over_quality() function - this runs a premultiplied resize about 25% faster. That said, when you really care about speed, using premultiplied pixels for both in and out (STBIR_RGBA_PM, etc) much faster than both of these premultiplied @@ -218,38 +218,38 @@ layouts with the same number of channels. DETERMINISM - We commit to being deterministic (from x64 to ARM to scalar to SIMD, etc). - This requires compiling with fast-math off (using at least /fp:precise). + We commit to being deterministic (from x64 to ARM to scalar to SIMD, etc). + This requires compiling with fast-math off (using at least /fp:precise). Also, you must turn off fp-contracting (which turns mult+adds into fmas)! - We attempt to do this with pragmas, but with Clang, you usually want to add + We attempt to do this with pragmas, but with Clang, you usually want to add -ffp-contract=off to the command line as well. - For 32-bit x86, you must use SSE and SSE2 codegen for determinism. That is, - if the scalar x87 unit gets used at all, we immediately lose determinism. + For 32-bit x86, you must use SSE and SSE2 codegen for determinism. That is, + if the scalar x87 unit gets used at all, we immediately lose determinism. On Microsoft Visual Studio 2008 and earlier, from what we can tell there is - no way to be deterministic in 32-bit x86 (some x87 always leaks in, even - with fp:strict). On 32-bit x86 GCC, determinism requires both -msse2 and + no way to be deterministic in 32-bit x86 (some x87 always leaks in, even + with fp:strict). On 32-bit x86 GCC, determinism requires both -msse2 and -fpmath=sse. Note that we will not be deterministic with float data containing NaNs - - the NaNs will propagate differently on different SIMD and platforms. + the NaNs will propagate differently on different SIMD and platforms. - If you turn on STBIR_USE_FMA, then we will be deterministic with other - fma targets, but we will differ from non-fma targets (this is unavoidable, - because a fma isn't simply an add with a mult - it also introduces a - rounding difference compared to non-fma instruction sequences. + If you turn on STBIR_USE_FMA, then we will be deterministic with other + fma targets, but we will differ from non-fma targets (this is unavoidable, + because a fma isn't simply an add with a mult - it also introduces a + rounding difference compared to non-fma instruction sequences. FLOAT PIXEL FORMAT RANGE - Any range of values can be used for the non-alpha float data that you pass - in (0 to 1, -1 to 1, whatever). However, if you are inputting float values - but *outputting* bytes or shorts, you must use a range of 0 to 1 so that we - scale back properly. The alpha channel must also be 0 to 1 for any format - that does premultiplication prior to resizing. + Any range of values can be used for the non-alpha float data that you pass + in (0 to 1, -1 to 1, whatever). However, if you are inputting float values + but *outputting* bytes or shorts, you must use a range of 0 to 1 so that we + scale back properly. The alpha channel must also be 0 to 1 for any format + that does premultiplication prior to resizing. - Note also that with float output, using filters with negative lobes, the - output filtered values might go slightly out of range. You can define - STBIR_FLOAT_LOW_CLAMP and/or STBIR_FLOAT_HIGH_CLAMP to specify the range - to clamp to on output, if that's important. + Note also that with float output, using filters with negative lobes, the + output filtered values might go slightly out of range. You can define + STBIR_FLOAT_LOW_CLAMP and/or STBIR_FLOAT_HIGH_CLAMP to specify the range + to clamp to on output, if that's important. MAX/MIN SCALE FACTORS The input pixel resolutions are in integers, and we do the internal pointer @@ -263,13 +263,13 @@ buffers). FLIPPED IMAGES - Stride is just the delta from one scanline to the next. This means you can - use a negative stride to handle inverted images (point to the final + Stride is just the delta from one scanline to the next. This means you can + use a negative stride to handle inverted images (point to the final scanline and use a negative stride). You can invert the input or output, using negative strides. DEFAULT FILTERS - For functions which don't provide explicit control over what filters to + For functions which don't provide explicit control over what filters to use, you can change the compile-time defaults with: #define STBIR_DEFAULT_FILTER_UPSAMPLE STBIR_FILTER_something @@ -278,18 +278,18 @@ See stbir_filter in the header-file section for the list of filters. NEW FILTERS - A number of 1D filter kernels are supplied. For a list of supported - filters, see the stbir_filter enum. You can install your own filters by + A number of 1D filter kernels are supplied. For a list of supported + filters, see the stbir_filter enum. You can install your own filters by using the stbir_set_filter_callbacks function. PROGRESS - For interactive use with slow resize operations, you can use the the - scanline callbacks in the extended API. It would have to be a *very* large + For interactive use with slow resize operations, you can use the the + scanline callbacks in the extended API. It would have to be a *very* large image resample to need progress though - we're very fast. CEIL and FLOOR - In scalar mode, the only functions we use from math.h are ceilf and floorf, - but if you have your own versions, you can define the STBIR_CEILF(v) and + In scalar mode, the only functions we use from math.h are ceilf and floorf, + but if you have your own versions, you can define the STBIR_CEILF(v) and STBIR_FLOORF(v) macros and we'll use them instead. In SIMD, we just use our own versions. @@ -304,7 +304,7 @@ * For SIMD encode and decode scanline routines, do any pre-aligning for bad input/output buffer alignments and pitch? * For very wide scanlines, we should we do vertical strips to stay within - L2 cache. Maybe do chunks of 1K pixels at a time. There would be + L2 cache. Maybe do chunks of 1K pixels at a time. There would be some pixel reconversion, but probably dwarfed by things falling out of cache. Probably also something possible with alternating between scattering and gathering at high resize scales? @@ -316,21 +316,38 @@ the pivot cost and the extra memory touches). Need to buffer the whole image so have to balance memory use. * Most of our code is internally function pointers, should we compile - all the SIMD stuff always and dynamically dispatch? + all the SIMD stuff always and dynamically dispatch? CONTRIBUTORS Jeff Roberts: 2.0 implementation, optimizations, SIMD - Martins Mozeiko: NEON simd, WASM simd, clang and GCC whisperer. + Martins Mozeiko: NEON simd, WASM simd, clang and GCC whisperer Fabian Giesen: half float and srgb converters Sean Barrett: API design, optimizations Jorge L Rodriguez: Original 1.0 implementation - Aras Pranckevicius: bugfixes for 1.0 + Aras Pranckevicius: bugfixes Nathan Reed: warning fixes for 1.0 REVISIONS - 2.00 (2022-02-20) mostly new source: new api, optimizations, simd, vertical-first, etc - (2x-5x faster without simd, 4x-12x faster with simd) - (in some cases, 20x to 40x faster - resizing to very small for example) + 2.10 (2024-07-27) fix the defines GCC and mingw for loop unroll control, + fix MSVC 32-bit arm half float routines. + 2.09 (2024-06-19) fix the defines for 32-bit ARM GCC builds (was selecting + hardware half floats). + 2.08 (2024-06-10) fix for RGB->BGR three channel flips and add SIMD (thanks + to Ryan Salsbury), fix for sub-rect resizes, use the + pragmas to control unrolling when they are available. + 2.07 (2024-05-24) fix for slow final split during threaded conversions of very + wide scanlines when downsampling (caused by extra input + converting), fix for wide scanline resamples with many + splits (int overflow), fix GCC warning. + 2.06 (2024-02-10) fix for identical width/height 3x or more down-scaling + undersampling a single row on rare resize ratios (about 1%). + 2.05 (2024-02-07) fix for 2 pixel to 1 pixel resizes with wrap (thanks Aras), + fix for output callback (thanks Julien Koenen). + 2.04 (2023-11-17) fix for rare AVX bug, shadowed symbol (thanks Nikola Smiljanic). + 2.03 (2023-11-01) ASAN and TSAN warnings fixed, minor tweaks. + 2.00 (2023-10-10) mostly new source: new api, optimizations, simd, vertical-first, etc + 2x-5x faster without simd, 4x-12x faster with simd, + in some cases, 20x to 40x faster esp resizing large to very small. 0.96 (2019-03-04) fixed warnings 0.95 (2017-07-23) fixed warnings 0.94 (2017-03-18) fixed warnings @@ -368,7 +385,7 @@ typedef uint64_t stbir_uint64; #define STBIR_SSE #endif #endif -#endif +#endif #if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(_M_AMD64) || defined(__SSE2__) || defined(STBIR_SSE) || defined(STBIR_SSE2) #ifndef STBIR_SSE2 @@ -383,7 +400,7 @@ typedef uint64_t stbir_uint64; #endif #if defined(__AVX2__) || defined(STBIR_AVX2) #ifndef STBIR_NO_AVX2 - #ifndef STBIR_AVX2 + #ifndef STBIR_AVX2 #define STBIR_AVX2 #endif #if defined( _MSC_VER ) && !defined(__clang__) @@ -400,15 +417,15 @@ typedef uint64_t stbir_uint64; #endif #endif -#if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || defined(_M_ARM) || (__ARM_NEON_FP & 4) != 0 && __ARM_FP16_FORMAT_IEEE != 0 +#if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || ((__ARM_NEON_FP & 4) != 0) || defined(__ARM_NEON__) #ifndef STBIR_NEON #define STBIR_NEON #endif #endif -#if defined(_M_ARM) +#if defined(_M_ARM) || defined(__arm__) #ifdef STBIR_USE_FMA -#undef STBIR_USE_FMA // no FMA for 32-bit arm on MSVC +#undef STBIR_USE_FMA // no FMA for 32-bit arm on MSVC #endif #endif @@ -435,7 +452,7 @@ typedef uint64_t stbir_uint64; // // Easy-to-use API: // -// * stride is the offset between successive rows of image data +// * stride is the offset between successive rows of image data // in memory, in bytes. specify 0 for packed continuously in memory // * colorspace is linear or sRGB as specified by function name // * Uses the default filters @@ -448,27 +465,35 @@ typedef uint64_t stbir_uint64; // order of channels // whether color is premultiplied by alpha // for back compatibility, you can cast the old channel count to an stbir_pixel_layout -typedef enum +typedef enum { - STBIR_BGR = 0, // 3-chan, with order specified (for channel flipping) - STBIR_1CHANNEL = 1, + STBIR_1CHANNEL = 1, STBIR_2CHANNEL = 2, - STBIR_RGB = 3, // 3-chan, with order specified (for channel flipping) - STBIR_RGBA = 4, // alpha formats, alpha is NOT premultiplied into color channels - + STBIR_RGB = 3, // 3-chan, with order specified (for channel flipping) + STBIR_BGR = 0, // 3-chan, with order specified (for channel flipping) STBIR_4CHANNEL = 5, + + STBIR_RGBA = 4, // alpha formats, where alpha is NOT premultiplied into color channels STBIR_BGRA = 6, STBIR_ARGB = 7, STBIR_ABGR = 8, STBIR_RA = 9, STBIR_AR = 10, - STBIR_RGBA_PM = 11, // alpha formats, alpha is premultiplied into color channels + STBIR_RGBA_PM = 11, // alpha formats, where alpha is premultiplied into color channels STBIR_BGRA_PM = 12, STBIR_ARGB_PM = 13, STBIR_ABGR_PM = 14, STBIR_RA_PM = 15, STBIR_AR_PM = 16, + + STBIR_RGBA_NO_AW = 11, // alpha formats, where NO alpha weighting is applied at all! + STBIR_BGRA_NO_AW = 12, // these are just synonyms for the _PM flags (which also do + STBIR_ARGB_NO_AW = 13, // no alpha weighting). These names just make it more clear + STBIR_ABGR_NO_AW = 14, // for some folks). + STBIR_RA_NO_AW = 15, + STBIR_AR_NO_AW = 16, + } stbir_pixel_layout; //=============================================================== @@ -549,8 +574,8 @@ STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int inpu // * Separate input and output data types // * Can specify regions with subpixel correctness // * Can specify alpha flags -// * Can specify a memory callback -// * Can specify a callback data type for pixel input and output +// * Can specify a memory callback +// * Can specify a callback data type for pixel input and output // * Can be threaded for a single resize // * Can be used to resize many frames without recalculating the sampler info // @@ -577,7 +602,7 @@ typedef float stbir__kernel_callback( float x, float scale, void * user_data ); typedef float stbir__support_callback( float scale, void * user_data ); // internal structure with precomputed scaling -typedef struct stbir__info stbir__info; +typedef struct stbir__info stbir__info; typedef struct STBIR_RESIZE // use the stbir_resize_init and stbir_override functions to set these values for future compatibility { @@ -604,7 +629,7 @@ typedef struct STBIR_RESIZE // use the stbir_resize_init and stbir_override fun stbir_edge horizontal_edge, vertical_edge; stbir__kernel_callback * horizontal_filter_kernel; stbir__support_callback * horizontal_filter_support; stbir__kernel_callback * vertical_filter_kernel; stbir__support_callback * vertical_filter_support; - stbir__info * samplers; + stbir__info * samplers; } STBIR_RESIZE; // extended complexity api @@ -620,7 +645,7 @@ STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize, // You can update these parameters any time after resize_init and there is no cost //-------------------------------- -STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type ); +STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type ); STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb ); // no callbacks by default STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data ); // pass back STBIR_RESIZE* by default STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes ); @@ -636,7 +661,7 @@ STBIRDEF int stbir_set_pixel_layouts( STBIR_RESIZE * resize, stbir_pixel_layout STBIRDEF int stbir_set_edgemodes( STBIR_RESIZE * resize, stbir_edge horizontal_edge, stbir_edge vertical_edge ); // CLAMP by default STBIRDEF int stbir_set_filters( STBIR_RESIZE * resize, stbir_filter horizontal_filter, stbir_filter vertical_filter ); // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE by default -STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support ); +STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support ); STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh ); // sets both sub-regions (full regions by default) STBIRDEF int stbir_set_input_subrect( STBIR_RESIZE * resize, double s0, double t0, double s1, double t1 ); // sets input sub-region (full region by default) @@ -658,7 +683,7 @@ STBIRDEF int stbir_set_non_pm_alpha_speed_over_quality( STBIR_RESIZE * resize, i //-------------------------------- // This builds the samplers and does one allocation -STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize ); +STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize ); // You MUST call this, if you call stbir_build_samplers or stbir_build_samplers_with_splits STBIRDEF void stbir_free_samplers( STBIR_RESIZE * resize ); @@ -681,7 +706,7 @@ STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize ); // It returns the number of splits (threads) that you can call it with. /// It might be less if the image resize can't be split up that many ways. -STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int try_splits ); +STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int try_splits ); // This function does a split of the resizing (you call this fuction for each // split, on multiple threads). A split is a piece of the output resize pixel space. @@ -691,10 +716,10 @@ STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int try_sp // Usually, you will always call stbir_resize_split with split_start as the thread_index // and "1" for the split_count. // But, if you have a weird situation where you MIGHT want 8 threads, but sometimes -// only 4 threads, you can use 0,2,4,6 for the split_start's and use "2" for the +// only 4 threads, you can use 0,2,4,6 for the split_start's and use "2" for the // split_count each time to turn in into a 4 thread resize. (This is unusual). -STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count ); +STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count ); //=============================================================== @@ -705,10 +730,10 @@ STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start // The input callback is super flexible - it calls you with the input address // (based on the stride and base pointer), it gives you an optional_output // pointer that you can fill, or you can just return your own pointer into -// your own data. +// your own data. // -// You can also do conversion from non-supported data types if necessary - in -// this case, you ignore the input_ptr and just use the x and y parameters to +// You can also do conversion from non-supported data types if necessary - in +// this case, you ignore the input_ptr and just use the x and y parameters to // calculate your own input_ptr based on the size of each non-supported pixel. // (Something like the third example below.) // @@ -722,14 +747,14 @@ STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start // return input_ptr; // use buffer from call // } // -// Next example, copying: (copy from some other buffer or stream): +// Next example, copying: (copy from some other buffer or stream): // void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context ) // { // CopyOrStreamData( optional_output, other_data_src, num_pixels * pixel_width_in_bytes ); // return optional_output; // return the optional buffer that we filled // } // -// Third example, input another buffer without copying: (zero-copy from other buffer): +// Third example, input another buffer without copying: (zero-copy from other buffer): // void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context ) // { // void * pixels = ( (char*) other_image_base ) + ( y * other_image_stride ) + ( x * other_pixel_width_in_bytes ); @@ -758,7 +783,7 @@ STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start #ifdef STBIR_PROFILE -typedef struct STBIR_PROFILE_INFO +typedef struct STBIR_PROFILE_INFO { stbir_uint64 total_clocks; @@ -766,7 +791,7 @@ typedef struct STBIR_PROFILE_INFO // there are "resize_count" number of zones stbir_uint64 clocks[ 8 ]; char const ** descriptions; - + // count of clocks and descriptions stbir_uint32 count; } STBIR_PROFILE_INFO; @@ -865,15 +890,15 @@ STBIRDEF void stbir_resize_split_profile_info( STBIR_PROFILE_INFO * out_info, ST #endif // the internal pixel layout enums are in a different order, so we can easily do range comparisons of types -// the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible -typedef enum +// the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible +typedef enum { STBIRI_1CHANNEL = 0, STBIRI_2CHANNEL = 1, STBIRI_RGB = 2, STBIRI_BGR = 3, STBIRI_4CHANNEL = 4, - + STBIRI_RGBA = 5, STBIRI_BGRA = 6, STBIRI_ARGB = 7, @@ -979,7 +1004,7 @@ typedef struct stbir__span spans[2]; // can be two spans, if doing input subrect with clamp mode WRAP } stbir__extents; -typedef struct +typedef struct { #ifdef STBIR_PROFILE union @@ -1010,7 +1035,7 @@ typedef struct typedef void stbir__decode_pixels_func( float * decode, int width_times_channels, void const * input ); typedef void stbir__alpha_weight_func( float * decode_buffer, int width_times_channels ); -typedef void stbir__horizontal_gather_channels_func( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, +typedef void stbir__horizontal_gather_channels_func( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width ); typedef void stbir__alpha_unweight_func(float * encode_buffer, int width_times_channels ); typedef void stbir__encode_pixels_func( void * output, int width_times_channels, float const * encode ); @@ -1053,10 +1078,10 @@ struct stbir__info stbir__horizontal_gather_channels_func * horizontal_gather_channels; stbir__alpha_unweight_func * alpha_unweight; stbir__encode_pixels_func * encode_pixels; - - int alloced_total; + + int alloc_ring_buffer_num_entries; // Number of entries in the ring buffer that will be allocated int splits; // count of splits - + stbir_internal_pixel_layout input_pixel_layout_internal; stbir_internal_pixel_layout output_pixel_layout_internal; @@ -1065,7 +1090,7 @@ struct stbir__info int vertical_first; int channels; int effective_channels; // same as channels, except on RGBA/ARGB (7), or XA/AX (3) - int alloc_ring_buffer_num_entries; // Number of entries in the ring buffer that will be allocated + size_t alloced_total; }; @@ -1076,10 +1101,11 @@ struct stbir__info #define stbir__small_float ((float)1 / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20)) // min/max friendly -#define STBIR_CLAMP(x, xmin, xmax) do { \ +#define STBIR_CLAMP(x, xmin, xmax) for(;;) { \ if ( (x) < (xmin) ) (x) = (xmin); \ if ( (x) > (xmax) ) (x) = (xmax); \ -} while (0) + break; \ +} static stbir__inline int stbir__min(int a, int b) { @@ -1141,7 +1167,7 @@ static const stbir_uint32 fp32_to_srgb8_tab4[104] = { 0x44c20798, 0x488e071e, 0x4c1c06b6, 0x4f76065d, 0x52a50610, 0x55ac05cc, 0x5892058f, 0x5b590559, 0x5e0c0a23, 0x631c0980, 0x67db08f6, 0x6c55087f, 0x70940818, 0x74a007bd, 0x787d076c, 0x7c330723, }; - + static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) { static const stbir__FP32 almostone = { 0x3f7fffff }; // 1-eps @@ -1172,19 +1198,44 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #define STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT 32 // when downsampling and <= 32 scanlines of buffering, use gather. gather used down to 1/8th scaling for 25% win. #endif -// restrict pointers for the output pointers +#ifndef STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS +#define STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS 4 // when threading, what is the minimum number of scanlines for a split? +#endif + +// restrict pointers for the output pointers, other loop and unroll control #if defined( _MSC_VER ) && !defined(__clang__) #define STBIR_STREAMOUT_PTR( star ) star __restrict #define STBIR_NO_UNROLL( ptr ) __assume(ptr) // this oddly keeps msvc from unrolling a loop -#elif defined( __clang__ ) - #define STBIR_STREAMOUT_PTR( star ) star __restrict__ - #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr)) -#elif defined( __GNUC__ ) + #if _MSC_VER >= 1900 + #define STBIR_NO_UNROLL_LOOP_START __pragma(loop( no_vector )) + #else + #define STBIR_NO_UNROLL_LOOP_START + #endif +#elif defined( __clang__ ) + #define STBIR_STREAMOUT_PTR( star ) star __restrict__ + #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr)) + #if ( __clang_major__ >= 4 ) || ( ( __clang_major__ >= 3 ) && ( __clang_minor__ >= 5 ) ) + #define STBIR_NO_UNROLL_LOOP_START _Pragma("clang loop unroll(disable)") _Pragma("clang loop vectorize(disable)") + #else + #define STBIR_NO_UNROLL_LOOP_START + #endif +#elif defined( __GNUC__ ) #define STBIR_STREAMOUT_PTR( star ) star __restrict__ #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr)) + #if __GNUC__ >= 14 + #define STBIR_NO_UNROLL_LOOP_START _Pragma("GCC unroll 0") _Pragma("GCC novector") + #else + #define STBIR_NO_UNROLL_LOOP_START + #endif + #define STBIR_NO_UNROLL_LOOP_START_INF_FOR #else #define STBIR_STREAMOUT_PTR( star ) star #define STBIR_NO_UNROLL( ptr ) + #define STBIR_NO_UNROLL_LOOP_START +#endif + +#ifndef STBIR_NO_UNROLL_LOOP_START_INF_FOR +#define STBIR_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START #endif #ifdef STBIR_NO_SIMD // force simd off for whatever reason @@ -1223,7 +1274,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #ifdef STBIR_SSE2 #include - + #define stbir__simdf __m128 #define stbir__simdi __m128i @@ -1254,7 +1305,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #define stbir__simdi_store2( ptr, reg ) _mm_storel_epi64( (__m128i*)(ptr), (reg) ) #define stbir__prefetch( ptr ) _mm_prefetch((char*)(ptr), _MM_HINT_T0 ) - + #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \ { \ stbir__simdi zero = _mm_setzero_si128(); \ @@ -1285,7 +1336,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),_mm_setzero_ps())))) #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps())))) - #define stbir__simdi_to_int( i ) _mm_cvtsi128_si32(i) + #define stbir__simdi_to_int( i ) _mm_cvtsi128_si32(i) #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = _mm_cvtepi32_ps( ireg ) #define stbir__simdf_add( out, reg0, reg1 ) (out) = _mm_add_ps( reg0, reg1 ) #define stbir__simdf_mult( out, reg0, reg1 ) (out) = _mm_mul_ps( reg0, reg1 ) @@ -1440,10 +1491,10 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #define stbir__simdi8_convert_i32_to_float(out, ireg) (out) = _mm256_cvtepi32_ps( ireg ) #define stbir__simdf8_convert_float_to_i32( i, f ) (i) = _mm256_cvttps_epi32(f) - + #define stbir__simdf8_bot4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (0<<0)+(2<<4) ) #define stbir__simdf8_top4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (1<<0)+(3<<4) ) - + #define stbir__simdf8_gettop4( reg ) _mm256_extractf128_ps(reg,1) #ifdef STBIR_AVX2 @@ -1471,8 +1522,8 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) out = _mm256_castsi256_si128( _mm256_permute4x64_epi64( _mm256_packus_epi16( t, t ), (0<<0)+(2<<2)+(1<<4)+(3<<6) ) ); \ } - #define stbir__simdi8_expand_u16_to_u32(out,ireg) out = _mm256_unpacklo_epi16( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), _mm256_setzero_si256() ); - + #define stbir__simdi8_expand_u16_to_u32(out,ireg) out = _mm256_unpacklo_epi16( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), _mm256_setzero_si256() ); + #define stbir__simdf8_pack_to_16words(out,aa,bb) \ { \ stbir__simdf8 af,bf; \ @@ -1496,7 +1547,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) a = _mm_unpackhi_epi8( ireg, zero ); \ out1 = _mm256_setr_m128i( _mm_unpacklo_epi16( a, zero ), _mm_unpackhi_epi16( a, zero ) ); \ } - + #define stbir__simdf8_pack_to_16bytes(out,aa,bb) \ { \ stbir__simdi t; \ @@ -1514,7 +1565,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) t = _mm_packus_epi16( t, t ); \ out = _mm_castps_si128( _mm_shuffle_ps( _mm_castsi128_ps(out), _mm_castsi128_ps(t), (0<<0)+(1<<2)+(0<<4)+(1<<6) ) ); \ } - + #define stbir__simdi8_expand_u16_to_u32(out,ireg) \ { \ stbir__simdi a,b,zero = _mm_setzero_si128(); \ @@ -1549,7 +1600,6 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #define stbir__simdf8_0123to2222( out, in ) (out) = stbir__simdf_swiz(_mm256_castps256_ps128(in), 2,2,2,2 ) - #define stbir__simdf8_load2( out, ptr ) (out) = _mm256_castsi256_ps(_mm256_castsi128_si256( _mm_loadl_epi64( (__m128i*)(ptr)) )) // top values can be random (not denormal or nan for perf) #define stbir__simdf8_load4b( out, ptr ) (out) = _mm256_broadcast_ps( (__m128 const *)(ptr) ) static __m256i stbir_00112233 = { STBIR__CONST_4d_32i( 0, 0, 1, 1 ), STBIR__CONST_4d_32i( 2, 2, 3, 3 ) }; @@ -1582,11 +1632,11 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #ifdef STBIR_USE_FMA // not on by default to maintain bit identical simd to non-simd #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_fmadd_ps( mul1, mul2, add ) #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_fmadd_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ), add ) - #define stbir__simdf8_madd_mem4( out, add, mul, ptr ) (out) = _mm256_fmadd_ps( _mm256_castps128_ps256( mul ), _mm256_castps128_ps256( _mm_loadu_ps( (float const*)(ptr) ) ), add ) + #define stbir__simdf8_madd_mem4( out, add, mul, ptr )(out) = _mm256_fmadd_ps( _mm256_setr_m128( mul, _mm_setzero_ps() ), _mm256_setr_m128( _mm_loadu_ps( (float const*)(ptr) ), _mm_setzero_ps() ), add ) #else #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul1, mul2 ) ) #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ) ) ) - #define stbir__simdf8_madd_mem4( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_castps128_ps256( _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ) ) ) + #define stbir__simdf8_madd_mem4( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_setr_m128( _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ), _mm_setzero_ps() ) ) #endif #define stbir__if_simdf8_cast_to_simdf4( val ) _mm256_castps256_ps128( val ) @@ -1627,7 +1677,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) } #elif defined(STBIR_NEON) - + #include #define stbir__simdf float32x4_t @@ -1686,7 +1736,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #define stbir__simdf_convert_float_to_i32( i, f ) (i) = vreinterpretq_u32_s32( vcvtq_s32_f32(f) ) #define stbir__simdf_convert_float_to_int( f ) vgetq_lane_s32(vcvtq_s32_f32(f), 0) - #define stbir__simdi_to_int( i ) (int)vgetq_lane_u32(i, 0) + #define stbir__simdi_to_int( i ) (int)vgetq_lane_u32(i, 0) #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),vdupq_n_f32(0))), 0)) #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),vdupq_n_f32(0))), 0)) #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = vcvtq_f32_s32( vreinterpretq_s32_u32(ireg) ) @@ -1737,12 +1787,20 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) ((stbir_uint64)(4*b+0)<<32) | ((stbir_uint64)(4*b+1)<<40) | ((stbir_uint64)(4*b+2)<<48) | ((stbir_uint64)(4*b+3)<<56)), \ vcreate_u8( (4*c+0) | ((4*c+1)<<8) | ((4*c+2)<<16) | ((4*c+3)<<24) | \ ((stbir_uint64)(4*d+0)<<32) | ((stbir_uint64)(4*d+1)<<40) | ((stbir_uint64)(4*d+2)<<48) | ((stbir_uint64)(4*d+3)<<56) ) ) + + static stbir__inline uint8x16x2_t stbir_make16x2(float32x4_t rega,float32x4_t regb) + { + uint8x16x2_t r = { vreinterpretq_u8_f32(rega), vreinterpretq_u8_f32(regb) }; + return r; + } #else #define stbir_make16(a,b,c,d) (uint8x16_t){4*a+0,4*a+1,4*a+2,4*a+3,4*b+0,4*b+1,4*b+2,4*b+3,4*c+0,4*c+1,4*c+2,4*c+3,4*d+0,4*d+1,4*d+2,4*d+3} + #define stbir_make16x2(a,b) (uint8x16x2_t){{vreinterpretq_u8_f32(a),vreinterpretq_u8_f32(b)}} #endif #define stbir__simdf_swiz( reg, one, two, three, four ) vreinterpretq_f32_u8( vqtbl1q_u8( vreinterpretq_u8_f32(reg), stbir_make16(one, two, three, four) ) ) - + #define stbir__simdf_swiz2( rega, regb, one, two, three, four ) vreinterpretq_f32_u8( vqtbl2q_u8( stbir_make16x2(rega,regb), stbir_make16(one, two, three, four) ) ) + #define stbir__simdi_16madd( out, reg0, reg1 ) \ { \ int16x8_t r0 = vreinterpretq_s16_u32(reg0); \ @@ -1942,7 +2000,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #define stbir__simdf_convert_float_to_i32( i, f ) (i) = wasm_i32x4_trunc_sat_f32x4(f) #define stbir__simdf_convert_float_to_int( f ) wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(f), 0) - #define stbir__simdi_to_int( i ) wasm_i32x4_extract_lane(i, 0) + #define stbir__simdi_to_int( i ) wasm_i32x4_extract_lane(i, 0) #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint8_as_float),wasm_f32x4_const_splat(0))), 0)) #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint16_as_float),wasm_f32x4_const_splat(0))), 0)) #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = wasm_f32x4_convert_i32x4(ireg) @@ -2125,7 +2183,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #endif -#if defined(STBIR_NEON) && !defined(_M_ARM) +#if defined(STBIR_NEON) && !defined(_M_ARM) && !defined(__arm__) #if defined( _MSC_VER ) && !defined(__clang__) typedef __int16 stbir__FP16; @@ -2142,7 +2200,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #endif -#if !defined(STBIR_NEON) && !defined(STBIR_FP16C) || defined(STBIR_NEON) && defined(_M_ARM) +#if (!defined(STBIR_NEON) && !defined(STBIR_FP16C)) || (defined(STBIR_NEON) && defined(_M_ARM)) || (defined(STBIR_NEON) && defined(__arm__)) // Fabian's half float routines, see: https://gist.github.com/rygorous/2156668 @@ -2168,7 +2226,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) unsigned int sign_mask = 0x80000000u; stbir__FP16 o = { 0 }; stbir__FP32 f; - unsigned int sign; + unsigned int sign; f.f = val; sign = f.u & sign_mask; @@ -2369,24 +2427,6 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) stbir__simdi_store( output,final ); } -#elif defined(STBIR_WASM) || (defined(STBIR_NEON) && defined(_MSC_VER) && defined(_M_ARM)) // WASM or 32-bit ARM on MSVC/clang - - static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input) - { - for (int i=0; i<8; i++) - { - output[i] = stbir__half_to_float(input[i]); - } - } - - static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input) - { - for (int i=0; i<8; i++) - { - output[i] = stbir__float_to_half(input[i]); - } - } - #elif defined(STBIR_NEON) && defined(_MSC_VER) && defined(_M_ARM64) && !defined(__clang__) // 64-bit ARM on MSVC (not clang) static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input) @@ -2415,7 +2455,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0).n16_u16[0]; } -#elif defined(STBIR_NEON) // 64-bit ARM +#elif defined(STBIR_NEON) && ( defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) ) // 64-bit ARM static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input) { @@ -2441,6 +2481,23 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0); } +#elif defined(STBIR_WASM) || (defined(STBIR_NEON) && (defined(_MSC_VER) || defined(_M_ARM) || defined(__arm__))) // WASM or 32-bit ARM on MSVC/clang + + static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input) + { + for (int i=0; i<8; i++) + { + output[i] = stbir__half_to_float(input[i]); + } + } + static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input) + { + for (int i=0; i<8; i++) + { + output[i] = stbir__float_to_half(input[i]); + } + } + #endif @@ -2462,10 +2519,10 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in) #define stbir__simdf_0123to3012( out, reg ) (out) = stbir__simdf_swiz( reg, 3,0,1,2 ) #define stbir__simdf_0123to0011( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,1,1 ) #define stbir__simdf_0123to1100( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,0,0 ) -#define stbir__simdf_0123to2233( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,3,3 ) -#define stbir__simdf_0123to1133( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,3,3 ) -#define stbir__simdf_0123to0022( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,2 ) -#define stbir__simdf_0123to1032( out, reg ) (out) = stbir__simdf_swiz( reg, 1,0,3,2 ) +#define stbir__simdf_0123to2233( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,3,3 ) +#define stbir__simdf_0123to1133( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,3,3 ) +#define stbir__simdf_0123to0022( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,2 ) +#define stbir__simdf_0123to1032( out, reg ) (out) = stbir__simdf_swiz( reg, 1,0,3,2 ) typedef union stbir__simdi_u32 { @@ -2493,14 +2550,16 @@ static const STBIR__SIMDI_CONST(STBIR_topscale, 0x02000000); // Adding this switch saves about 5K on clang which is Captain Unroll the 3rd. #define STBIR_SIMD_STREAMOUT_PTR( star ) STBIR_STREAMOUT_PTR( star ) #define STBIR_SIMD_NO_UNROLL(ptr) STBIR_NO_UNROLL(ptr) +#define STBIR_SIMD_NO_UNROLL_LOOP_START STBIR_NO_UNROLL_LOOP_START +#define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START_INF_FOR #ifdef STBIR_MEMCPY #undef STBIR_MEMCPY -#define STBIR_MEMCPY stbir_simd_memcpy #endif +#define STBIR_MEMCPY stbir_simd_memcpy // override normal use of memcpy with much simpler copy (faster and smaller with our sized copies) -static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes ) +static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes ) { char STBIR_SIMD_STREAMOUT_PTR (*) d = (char*) dest; char STBIR_SIMD_STREAMOUT_PTR( * ) d_end = ((char*) dest) + bytes; @@ -2513,8 +2572,9 @@ static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes ) { if ( bytes < 16 ) { - if ( bytes ) + if ( bytes ) { + STBIR_SIMD_NO_UNROLL_LOOP_START do { STBIR_SIMD_NO_UNROLL(d); @@ -2529,8 +2589,9 @@ static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes ) // do one unaligned to get us aligned for the stream out below stbir__simdf_load( x, ( d + ofs_to_src ) ); stbir__simdf_store( d, x ); - d = (char*)( ( ( (ptrdiff_t)d ) + 16 ) & ~15 ); + d = (char*)( ( ( (size_t)d ) + 16 ) & ~15 ); + STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR for(;;) { STBIR_SIMD_NO_UNROLL(d); @@ -2561,12 +2622,13 @@ static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes ) stbir__simdfX_store( d + 4*stbir__simdfX_float_count, x1 ); stbir__simdfX_store( d + 8*stbir__simdfX_float_count, x2 ); stbir__simdfX_store( d + 12*stbir__simdfX_float_count, x3 ); - d = (char*)( ( ( (ptrdiff_t)d ) + (16*stbir__simdfX_float_count) ) & ~((16*stbir__simdfX_float_count)-1) ); + d = (char*)( ( ( (size_t)d ) + (16*stbir__simdfX_float_count) ) & ~((16*stbir__simdfX_float_count)-1) ); + STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR for(;;) { STBIR_SIMD_NO_UNROLL(d); - + if ( d > ( d_end - (16*stbir__simdfX_float_count) ) ) { if ( d == d_end ) @@ -2590,7 +2652,7 @@ static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes ) // memcpy that is specically intentionally overlapping (src is smaller then dest, so can be // a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to // the diff between dest and src) -static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes ) +static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes ) { char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src; char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes; @@ -2599,6 +2661,7 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte if ( ofs_to_dest >= 16 ) // is the overlap more than 16 away? { char STBIR_SIMD_STREAMOUT_PTR( * ) s_end16 = ((char*) src) + (bytes&~15); + STBIR_SIMD_NO_UNROLL_LOOP_START do { stbir__simdf x; @@ -2615,7 +2678,7 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte do { STBIR_SIMD_NO_UNROLL(sd); - *(int*)( sd + ofs_to_dest ) = *(int*) sd; + *(int*)( sd + ofs_to_dest ) = *(int*) sd; sd += 4; } while ( sd < s_end ); } @@ -2624,13 +2687,17 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte // when in scalar mode, we let unrolling happen, so this macro just does the __restrict #define STBIR_SIMD_STREAMOUT_PTR( star ) STBIR_STREAMOUT_PTR( star ) -#define STBIR_SIMD_NO_UNROLL(ptr) +#define STBIR_SIMD_NO_UNROLL(ptr) +#define STBIR_SIMD_NO_UNROLL_LOOP_START +#define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR #endif // SSE2 #ifdef STBIR_PROFILE +#ifndef STBIR_PROFILE_FUNC + #if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(__SSE2__) || defined(STBIR_SSE) || defined( _M_IX86_FP ) || defined(__i386) || defined( __i386__ ) || defined( _M_IX86 ) || defined( _X86_ ) #ifdef _MSC_VER @@ -2640,7 +2707,7 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte #else // non msvc - static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC() + static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC() { stbir_uint32 lo, hi; asm volatile ("rdtsc" : "=a" (lo), "=d" (hi) ); @@ -2649,7 +2716,7 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte #endif // msvc -#elif defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || defined(__ARM_NEON__) +#elif defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || defined(__ARM_NEON__) #if defined( _MSC_VER ) && !defined(__clang__) @@ -2670,8 +2737,9 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte #error Unknown platform for profiling. -#endif //x64 and +#endif // x64, arm +#endif // STBIR_PROFILE_FUNC #define STBIR_ONLY_PROFILE_GET_SPLIT_INFO ,stbir__per_split_info * split_info #define STBIR_ONLY_PROFILE_SET_SPLIT_INFO ,split_info @@ -2680,7 +2748,7 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte #define STBIR_ONLY_PROFILE_BUILD_SET_INFO ,profile_info // super light-weight micro profiler -#define STBIR_PROFILE_START_ll( info, wh ) { stbir_uint64 wh##thiszonetime = STBIR_PROFILE_FUNC(); stbir_uint64 * wh##save_parent_excluded_ptr = info->current_zone_excluded_ptr; stbir_uint64 wh##current_zone_excluded = 0; info->current_zone_excluded_ptr = &wh##current_zone_excluded; +#define STBIR_PROFILE_START_ll( info, wh ) { stbir_uint64 wh##thiszonetime = STBIR_PROFILE_FUNC(); stbir_uint64 * wh##save_parent_excluded_ptr = info->current_zone_excluded_ptr; stbir_uint64 wh##current_zone_excluded = 0; info->current_zone_excluded_ptr = &wh##current_zone_excluded; #define STBIR_PROFILE_END_ll( info, wh ) wh##thiszonetime = STBIR_PROFILE_FUNC() - wh##thiszonetime; info->profile.named.wh += wh##thiszonetime - wh##current_zone_excluded; *wh##save_parent_excluded_ptr += wh##thiszonetime; info->current_zone_excluded_ptr = wh##save_parent_excluded_ptr; } #define STBIR_PROFILE_FIRST_START_ll( info, wh ) { int i; info->current_zone_excluded_ptr = &info->profile.named.total; for(i=0;iprofile.array);i++) info->profile.array[i]=0; } STBIR_PROFILE_START_ll( info, wh ); #define STBIR_PROFILE_CLEAR_EXTRAS_ll( info, num ) { int extra; for(extra=1;extra<(num);extra++) { int i; for(i=0;iprofile.array);i++) (info)[extra].profile.array[i]=0; } } @@ -2710,8 +2778,8 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte #define STBIR_PROFILE_FIRST_START( wh ) #define STBIR_PROFILE_CLEAR_EXTRAS( ) -#define STBIR_PROFILE_BUILD_START( wh ) -#define STBIR_PROFILE_BUILD_END( wh ) +#define STBIR_PROFILE_BUILD_START( wh ) +#define STBIR_PROFILE_BUILD_END( wh ) #define STBIR_PROFILE_BUILD_FIRST_START( wh ) #define STBIR_PROFILE_BUILD_CLEAR( info ) @@ -2736,10 +2804,10 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte #ifndef STBIR_SIMD -// memcpy that is specically intentionally overlapping (src is smaller then dest, so can be +// memcpy that is specifically intentionally overlapping (src is smaller then dest, so can be // a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to // the diff between dest and src) -static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes ) +static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes ) { char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src; char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes; @@ -2748,10 +2816,11 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte if ( ofs_to_dest >= 8 ) // is the overlap more than 8 away? { char STBIR_SIMD_STREAMOUT_PTR( * ) s_end8 = ((char*) src) + (bytes&~7); + STBIR_NO_UNROLL_LOOP_START do { STBIR_NO_UNROLL(sd); - *(stbir_uint64*)( sd + ofs_to_dest ) = *(stbir_uint64*) sd; + *(stbir_uint64*)( sd + ofs_to_dest ) = *(stbir_uint64*) sd; sd += 8; } while ( sd < s_end8 ); @@ -2759,10 +2828,11 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte return; } + STBIR_NO_UNROLL_LOOP_START do { STBIR_NO_UNROLL(sd); - *(int*)( sd + ofs_to_dest ) = *(int*) sd; + *(int*)( sd + ofs_to_dest ) = *(int*) sd; sd += 4; } while ( sd < s_end ); } @@ -2863,13 +2933,6 @@ static float stbir__filter_mitchell(float x, float s, void * user_data) return (0.0f); } -static float stbir__support_zero(float s, void * user_data) -{ - STBIR__UNUSED(s); - STBIR__UNUSED(user_data); - return 0; -} - static float stbir__support_zeropoint5(float s, void * user_data) { STBIR__UNUSED(s); @@ -2884,7 +2947,7 @@ static float stbir__support_one(float s, void * user_data) return 1; } -static float stbir__support_two(float s, void * user_data) +static float stbir__support_two(float s, void * user_data) { STBIR__UNUSED(s); STBIR__UNUSED(user_data); @@ -2903,7 +2966,7 @@ static int stbir__get_filter_pixel_width(stbir__support_callback * support, floa return (int)STBIR_CEILF(support(scale,user_data) * 2.0f / scale); } -// this is how many coefficents per run of the filter (which is different +// this is how many coefficents per run of the filter (which is different // from the filter_pixel_width depending on if we are scattering or gathering) static int stbir__get_coefficient_width(stbir__sampler * samp, int is_gather, void * user_data) { @@ -2924,7 +2987,7 @@ static int stbir__get_coefficient_width(stbir__sampler * samp, int is_gather, vo } } -static int stbir__get_contributors(stbir__sampler * samp, int is_gather) +static int stbir__get_contributors(stbir__sampler * samp, int is_gather) { if (is_gather) return samp->scale_info.output_sub_size; @@ -2954,7 +3017,7 @@ static int stbir__edge_reflect_full( int n, int max ) { if (n < 0) { - if (n > -max) + if (n > -max) return -n; else return max - 1; @@ -3056,7 +3119,7 @@ static void stbir__get_extents( stbir__sampler * samp, stbir__extents * scanline left_margin = -min_n; min_n = 0; } - + right_margin = 0; if ( max_n >= input_full_size ) { @@ -3081,7 +3144,7 @@ static void stbir__get_extents( stbir__sampler * samp, stbir__extents * scanline // don't have to do edge calc for zero clamp if ( edge == STBIR_EDGE_ZERO ) return; - + // convert margin pixels to the pixels within the input (min and max) for( j = -left_margin ; j < 0 ; j++ ) { @@ -3179,20 +3242,20 @@ static void stbir__calculate_in_pixel_range( int * first_pixel, int * last_pixel float out_pixel_influence_lowerbound = out_pixel_center - out_filter_radius; float out_pixel_influence_upperbound = out_pixel_center + out_filter_radius; - float in_pixel_influence_lowerbound = (out_pixel_influence_lowerbound + out_shift) * inv_scale; - float in_pixel_influence_upperbound = (out_pixel_influence_upperbound + out_shift) * inv_scale; + float in_pixel_influence_lowerbound = (out_pixel_influence_lowerbound + out_shift) * inv_scale; + float in_pixel_influence_upperbound = (out_pixel_influence_upperbound + out_shift) * inv_scale; first = (int)(STBIR_FLOORF(in_pixel_influence_lowerbound + 0.5f)); last = (int)(STBIR_FLOORF(in_pixel_influence_upperbound - 0.5f)); if ( edge == STBIR_EDGE_WRAP ) { - if ( first <= -input_size ) - first = -(input_size-1); + if ( first < -input_size ) + first = -input_size; if ( last >= (input_size*2)) last = (input_size*2) - 1; } - + *first_pixel = first; *last_pixel = last; } @@ -3213,10 +3276,10 @@ static void stbir__calculate_coefficients_for_gather_upsample( float out_filter_ int i; int last_non_zero; float out_pixel_center = (float)n + 0.5f; - float in_center_of_out = (out_pixel_center + out_shift) * inv_scale; + float in_center_of_out = (out_pixel_center + out_shift) * inv_scale; int in_first_pixel, in_last_pixel; - + stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, out_pixel_center, out_filter_radius, inv_scale, out_shift, input_size, edge ); last_non_zero = -1; @@ -3229,7 +3292,7 @@ static void stbir__calculate_coefficients_for_gather_upsample( float out_filter_ if ( ( ( coeff < stbir__small_float ) && ( coeff > -stbir__small_float ) ) ) { if ( i == 0 ) // if we're at the front, just eat zero contributors - { + { STBIR_ASSERT ( ( in_last_pixel - in_first_pixel ) != 0 ); // there should be at least one contrib ++in_first_pixel; i--; @@ -3239,10 +3302,10 @@ static void stbir__calculate_coefficients_for_gather_upsample( float out_filter_ } else last_non_zero = i; - + coefficient_group[i] = coeff; } - + in_last_pixel = last_non_zero+in_first_pixel; // kills trailing zeros contributors->n0 = in_first_pixel; contributors->n1 = in_last_pixel; @@ -3354,7 +3417,7 @@ static void stbir__calculate_coefficients_for_gather_downsample( int start, int stbir__contributors * contribs = contributors + out; // is this the first time this output pixel has been seen? Init it. - if ( out > first_out_inited ) + if ( out > first_out_inited ) { STBIR_ASSERT( out == ( first_out_inited + 1 ) ); // ensure we have only advanced one at time first_out_inited = out; @@ -3362,7 +3425,7 @@ static void stbir__calculate_coefficients_for_gather_downsample( int start, int contribs->n1 = in_pixel; coeffs[0] = coeff; } - else + else { // insert on end (always in order) if ( coeffs[0] == 0.0f ) // if the first coefficent is zero, then zap it for this coeffs @@ -3379,10 +3442,16 @@ static void stbir__calculate_coefficients_for_gather_downsample( int start, int } } +#ifdef STBIR_RENORMALIZE_IN_FLOAT +#define STBIR_RENORM_TYPE float +#else +#define STBIR_RENORM_TYPE double +#endif + static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter_extent_info* filter_info, stbir__scale_info * scale_info, int num_contributors, stbir__contributors* contributors, float * coefficient_group, int coefficient_width ) { int input_size = scale_info->input_full_size; - int input_last_n1 = input_size - 1; + int input_last_n1 = input_size - 1; int n, end; int lowest = 0x7fffffff; int highest = -0x7fffffff; @@ -3400,14 +3469,14 @@ static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter for (n = 0; n < end; n++) { int i; - float filter_scale, total_filter = 0; + STBIR_RENORM_TYPE filter_scale, total_filter = 0; int e; // add all contribs e = contribs->n1 - contribs->n0; for( i = 0 ; i <= e ; i++ ) { - total_filter += coeffs[i]; + total_filter += (STBIR_RENORM_TYPE) coeffs[i]; STBIR_ASSERT( ( coeffs[i] >= -2.0f ) && ( coeffs[i] <= 2.0f ) ); // check for wonky weights } @@ -3423,10 +3492,11 @@ static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter // if the total isn't 1.0, rescale everything if ( ( total_filter < (1.0f-stbir__small_float) ) || ( total_filter > (1.0f+stbir__small_float) ) ) { - filter_scale = 1.0f / total_filter; + filter_scale = ((STBIR_RENORM_TYPE)1.0) / total_filter; + // scale them all for (i = 0; i <= e; i++) - coeffs[i] *= filter_scale; + coeffs[i] = (float) ( coeffs[i] * filter_scale ); } } ++contribs; @@ -3483,13 +3553,13 @@ static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter else if ( ( edge == STBIR_EDGE_CLAMP ) || ( edge == STBIR_EDGE_REFLECT ) ) { // for clamp and reflect, calculate the true inbounds position (based on edge type) and just add that to the existing weight - + // right hand side first if ( contribs->n1 > input_last_n1 ) { int start = contribs->n0; int endi = contribs->n1; - contribs->n1 = input_last_n1; + contribs->n1 = input_last_n1; for( i = input_size; i <= endi; i++ ) stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), coeffs[i-start] ); } @@ -3500,18 +3570,18 @@ static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter int save_n0; float save_n0_coeff; float * c = coeffs - ( contribs->n0 + 1 ); - + // reinsert the coeffs with it reflected or clamped (insert accumulates, if the coeffs exist) - for( i = -1 ; i > contribs->n0 ; i-- ) + for( i = -1 ; i > contribs->n0 ; i-- ) stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), *c-- ); save_n0 = contribs->n0; save_n0_coeff = c[0]; // save it, since we didn't do the final one (i==n0), because there might be too many coeffs to hold (before we resize)! // now slide all the coeffs down (since we have accumulated them in the positive contribs) and reset the first contrib - contribs->n0 = 0; + contribs->n0 = 0; for(i = 0 ; i <= contribs->n1 ; i++ ) coeffs[i] = coeffs[i-save_n0]; - + // now that we have shrunk down the contribs, we insert the first one safely stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( save_n0, input_size ), save_n0_coeff ); } @@ -3547,7 +3617,9 @@ static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter filter_info->widest = widest; } -static int stbir__pack_coefficients( int num_contributors, stbir__contributors* contributors, float * coefficents, int coefficient_width, int widest, int row_width ) +#undef STBIR_RENORM_TYPE + +static int stbir__pack_coefficients( int num_contributors, stbir__contributors* contributors, float * coefficents, int coefficient_width, int widest, int row0, int row1 ) { #define STBIR_MOVE_1( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint32*)(dest))[0] = ((stbir_uint32*)(src))[0]; } #define STBIR_MOVE_2( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; } @@ -3556,6 +3628,10 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* #else #define STBIR_MOVE_4( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; ((stbir_uint64*)(dest))[1] = ((stbir_uint64*)(src))[1]; } #endif + + int row_end = row1 + 1; + STBIR__UNUSED( row0 ); // only used in an assert + if ( coefficient_width != widest ) { float * pc = coefficents; @@ -3564,6 +3640,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* switch( widest ) { case 1: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_1( pc, coeffs ); ++pc; @@ -3571,6 +3648,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 2: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_2( pc, coeffs ); pc += 2; @@ -3578,6 +3656,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 3: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_2( pc, coeffs ); STBIR_MOVE_1( pc+2, coeffs+2 ); @@ -3586,6 +3665,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 4: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_4( pc, coeffs ); pc += 4; @@ -3593,6 +3673,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 5: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_4( pc, coeffs ); STBIR_MOVE_1( pc+4, coeffs+4 ); @@ -3601,6 +3682,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 6: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_4( pc, coeffs ); STBIR_MOVE_2( pc+4, coeffs+4 ); @@ -3609,6 +3691,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 7: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_4( pc, coeffs ); STBIR_MOVE_2( pc+4, coeffs+4 ); @@ -3618,6 +3701,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 8: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_4( pc, coeffs ); STBIR_MOVE_4( pc+4, coeffs+4 ); @@ -3626,6 +3710,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 9: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_4( pc, coeffs ); STBIR_MOVE_4( pc+4, coeffs+4 ); @@ -3635,6 +3720,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 10: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_4( pc, coeffs ); STBIR_MOVE_4( pc+4, coeffs+4 ); @@ -3644,6 +3730,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 11: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_4( pc, coeffs ); STBIR_MOVE_4( pc+4, coeffs+4 ); @@ -3654,6 +3741,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; case 12: + STBIR_NO_UNROLL_LOOP_START do { STBIR_MOVE_4( pc, coeffs ); STBIR_MOVE_4( pc+4, coeffs+4 ); @@ -3663,6 +3751,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* } while ( pc < pc_end ); break; default: + STBIR_NO_UNROLL_LOOP_START do { float * copy_end = pc + widest - 4; float * c = coeffs; @@ -3673,6 +3762,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* c += 4; } while ( pc <= copy_end ); copy_end += 4; + STBIR_NO_UNROLL_LOOP_START while ( pc < copy_end ) { STBIR_MOVE_1( pc, c ); @@ -3688,7 +3778,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* coefficents[ widest * num_contributors ] = 8888.0f; // the minimum we might read for unrolled filters widths is 12. So, we need to - // make sure we never read outside the decode buffer, by possibly moving + // make sure we never read outside the decode buffer, by possibly moving // the sample area back into the scanline, and putting zeros weights first. // we start on the right edge and check until we're well past the possible // clip area (2*widest). @@ -3697,13 +3787,13 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* float * coeffs = coefficents + widest * ( num_contributors - 1 ); // go until no chance of clipping (this is usually less than 8 lops) - while ( ( ( contribs->n0 + widest*2 ) >= row_width ) && ( contribs >= contributors ) ) + while ( ( contribs >= contributors ) && ( ( contribs->n0 + widest*2 ) >= row_end ) ) { // might we clip?? - if ( ( contribs->n0 + widest ) > row_width ) + if ( ( contribs->n0 + widest ) > row_end ) { int stop_range = widest; - + // if range is larger than 12, it will be handled by generic loops that can terminate on the exact length // of this contrib n1, instead of a fixed widest amount - so calculate this if ( widest > 12 ) @@ -3712,22 +3802,22 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* // how far will be read in the n_coeff loop (which depends on the widest count mod4); mod = widest & 3; - stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod; + stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod; // the n_coeff loops do a minimum amount of coeffs, so factor that in! if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod; } // now see if we still clip with the refined range - if ( ( contribs->n0 + stop_range ) > row_width ) + if ( ( contribs->n0 + stop_range ) > row_end ) { - int new_n0 = row_width - stop_range; + int new_n0 = row_end - stop_range; int num = contribs->n1 - contribs->n0 + 1; int backup = contribs->n0 - new_n0; float * from_co = coeffs + num - 1; float * to_co = from_co + backup; - STBIR_ASSERT( ( new_n0 >= 0 ) && ( new_n0 < contribs->n0 ) ); + STBIR_ASSERT( ( new_n0 >= row0 ) && ( new_n0 < contribs->n0 ) ); // move the coeffs over while( num ) @@ -3746,7 +3836,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors* // how far will be read in the n_coeff loop (which depends on the widest count mod4); mod = widest & 3; - stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod; + stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod; // the n_coeff loops do a minimum amount of coeffs, so factor that in! if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod; @@ -3774,7 +3864,7 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot int input_full_size = samp->scale_info.input_full_size; int gather_num_contributors = samp->num_contributors; stbir__contributors* gather_contributors = samp->contributors; - float * gather_coeffs = samp->coefficients; + float * gather_coeffs = samp->coefficients; int gather_coefficient_width = samp->coefficient_width; switch ( samp->is_gather ) @@ -3792,16 +3882,16 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot break; case 0: // scatter downsample (only on vertical) - case 2: // gather downsample + case 2: // gather downsample { float in_pixels_radius = support(scale,user_data) * inv_scale; int filter_pixel_margin = samp->filter_pixel_margin; int input_end = input_full_size + filter_pixel_margin; - + // if this is a scatter, we do a downsample gather to get the coeffs, and then pivot after if ( !samp->is_gather ) { - // check if we are using the same gather downsample on the horizontal as this vertical, + // check if we are using the same gather downsample on the horizontal as this vertical, // if so, then we don't have to generate them, we can just pivot from the horizontal. if ( other_axis_for_pivot ) { @@ -3846,30 +3936,37 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot float * scatter_coeffs = samp->coefficients + ( gn0 + filter_pixel_margin ) * scatter_coefficient_width; float * g_coeffs = gather_coeffs; scatter_contributors = samp->contributors + ( gn0 + filter_pixel_margin ); - + for (k = gn0 ; k <= gn1 ; k++ ) { float gc = *g_coeffs++; - if ( ( k > highest_set ) || ( scatter_contributors->n0 > scatter_contributors->n1 ) ) + + // skip zero and denormals - must skip zeros to avoid adding coeffs beyond scatter_coefficient_width + // (which happens when pivoting from horizontal, which might have dummy zeros) + if ( ( ( gc >= stbir__small_float ) || ( gc <= -stbir__small_float ) ) ) { + if ( ( k > highest_set ) || ( scatter_contributors->n0 > scatter_contributors->n1 ) ) { - // if we are skipping over several contributors, we need to clear the skipped ones - stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1); - while ( clear_contributors < scatter_contributors ) { - clear_contributors->n0 = 0; - clear_contributors->n1 = -1; - ++clear_contributors; + // if we are skipping over several contributors, we need to clear the skipped ones + stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1); + while ( clear_contributors < scatter_contributors ) + { + clear_contributors->n0 = 0; + clear_contributors->n1 = -1; + ++clear_contributors; + } } + scatter_contributors->n0 = n; + scatter_contributors->n1 = n; + scatter_coeffs[0] = gc; + highest_set = k; } - scatter_contributors->n0 = n; - scatter_contributors->n1 = n; - scatter_coeffs[0] = gc; - highest_set = k; - } - else - { - stbir__insert_coeff( scatter_contributors, scatter_coeffs, n, gc ); + else + { + stbir__insert_coeff( scatter_contributors, scatter_coeffs, n, gc ); + } + STBIR_ASSERT( ( scatter_contributors->n1 - scatter_contributors->n0 + 1 ) <= scatter_coefficient_width ); } ++scatter_contributors; scatter_coeffs += scatter_coefficient_width; @@ -3908,11 +4005,11 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot #define stbir__decode_suffix BGRA #define stbir__decode_swizzle -#define stbir__decode_order0 2 +#define stbir__decode_order0 2 #define stbir__decode_order1 1 #define stbir__decode_order2 0 #define stbir__decode_order3 3 -#define stbir__encode_order0 2 +#define stbir__encode_order0 2 #define stbir__encode_order1 1 #define stbir__encode_order2 0 #define stbir__encode_order3 3 @@ -3922,11 +4019,11 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot #define stbir__decode_suffix ARGB #define stbir__decode_swizzle -#define stbir__decode_order0 1 +#define stbir__decode_order0 1 #define stbir__decode_order1 2 #define stbir__decode_order2 3 #define stbir__decode_order3 0 -#define stbir__encode_order0 3 +#define stbir__encode_order0 3 #define stbir__encode_order1 0 #define stbir__encode_order2 1 #define stbir__encode_order3 2 @@ -3936,11 +4033,11 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot #define stbir__decode_suffix ABGR #define stbir__decode_swizzle -#define stbir__decode_order0 3 +#define stbir__decode_order0 3 #define stbir__decode_order1 2 #define stbir__decode_order2 1 #define stbir__decode_order3 0 -#define stbir__encode_order0 3 +#define stbir__encode_order0 3 #define stbir__encode_order1 2 #define stbir__encode_order2 1 #define stbir__encode_order3 0 @@ -3950,12 +4047,12 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot #define stbir__decode_suffix AR #define stbir__decode_swizzle -#define stbir__decode_order0 1 -#define stbir__decode_order1 0 +#define stbir__decode_order0 1 +#define stbir__decode_order1 0 #define stbir__decode_order2 3 #define stbir__decode_order3 2 -#define stbir__encode_order0 1 -#define stbir__encode_order1 0 +#define stbir__encode_order0 1 +#define stbir__encode_order1 0 #define stbir__encode_order2 3 #define stbir__encode_order3 2 #define stbir__coder_min_num 2 @@ -3973,9 +4070,10 @@ static void stbir__fancy_alpha_weight_4ch( float * out_buffer, int width_times_c // fancy alpha is stored internally as R G B A Rpm Gpm Bpm #ifdef STBIR_SIMD - + #ifdef STBIR_SIMD8 decode += 16; + STBIR_NO_UNROLL_LOOP_START while ( decode <= end_decode ) { stbir__simdf8 d0,d1,a0,a1,p0,p1; @@ -3998,8 +4096,9 @@ static void stbir__fancy_alpha_weight_4ch( float * out_buffer, int width_times_c out += 28; } decode -= 16; - #else + #else decode += 8; + STBIR_NO_UNROLL_LOOP_START while ( decode <= end_decode ) { stbir__simdf d0,a0,d1,a1,p0,p1; @@ -4022,12 +4121,14 @@ static void stbir__fancy_alpha_weight_4ch( float * out_buffer, int width_times_c // might be one last odd pixel #ifdef STBIR_SIMD8 + STBIR_NO_UNROLL_LOOP_START while ( decode < end_decode ) #else if ( decode < end_decode ) #endif { stbir__simdf d,a,p; + STBIR_NO_UNROLL(decode); stbir__simdf_load( d, decode ); stbir__simdf_0123to3333( a, d ); stbir__simdf_mult( p, a, d ); @@ -4069,6 +4170,7 @@ static void stbir__fancy_alpha_weight_2ch( float * out_buffer, int width_times_c decode += 8; if ( decode <= end_decode ) { + STBIR_NO_UNROLL_LOOP_START do { #ifdef STBIR_SIMD8 stbir__simdf8 d0,a0,p0; @@ -4077,11 +4179,11 @@ static void stbir__fancy_alpha_weight_2ch( float * out_buffer, int width_times_c stbir__simdf8_0123to11331133( p0, d0 ); stbir__simdf8_0123to00220022( a0, d0 ); stbir__simdf8_mult( p0, p0, a0 ); - + stbir__simdf_store2( out, stbir__if_simdf8_cast_to_simdf4( d0 ) ); stbir__simdf_store( out+2, stbir__if_simdf8_cast_to_simdf4( p0 ) ); stbir__simdf_store2h( out+3, stbir__if_simdf8_cast_to_simdf4( d0 ) ); - + stbir__simdf_store2( out+6, stbir__simdf8_gettop4( d0 ) ); stbir__simdf_store( out+8, stbir__simdf8_gettop4( p0 ) ); stbir__simdf_store2h( out+9, stbir__simdf8_gettop4( d0 ) ); @@ -4112,6 +4214,7 @@ static void stbir__fancy_alpha_weight_2ch( float * out_buffer, int width_times_c decode -= 8; #endif + STBIR_SIMD_NO_UNROLL_LOOP_START while( decode < end_decode ) { float x = decode[0], y = decode[1]; @@ -4132,6 +4235,7 @@ static void stbir__fancy_alpha_unweight_4ch( float * encode_buffer, int width_ti // fancy RGBA is stored internally as R G B A Rpm Gpm Bpm + STBIR_SIMD_NO_UNROLL_LOOP_START do { float alpha = input[3]; #ifdef STBIR_SIMD @@ -4199,6 +4303,7 @@ static void stbir__simple_alpha_weight_4ch( float * decode_buffer, int width_tim #ifdef STBIR_SIMD { decode += 2 * stbir__simdfX_float_count; + STBIR_NO_UNROLL_LOOP_START while ( decode <= end_decode ) { stbir__simdfX d0,a0,d1,a1; @@ -4217,6 +4322,7 @@ static void stbir__simple_alpha_weight_4ch( float * decode_buffer, int width_tim // few last pixels remnants #ifdef STBIR_SIMD8 + STBIR_NO_UNROLL_LOOP_START while ( decode < end_decode ) #else if ( decode < end_decode ) @@ -4252,6 +4358,7 @@ static void stbir__simple_alpha_weight_2ch( float * decode_buffer, int width_tim #ifdef STBIR_SIMD decode += 2 * stbir__simdfX_float_count; + STBIR_NO_UNROLL_LOOP_START while ( decode <= end_decode ) { stbir__simdfX d0,a0,d1,a1; @@ -4269,6 +4376,7 @@ static void stbir__simple_alpha_weight_2ch( float * decode_buffer, int width_tim decode -= 2 * stbir__simdfX_float_count; #endif + STBIR_SIMD_NO_UNROLL_LOOP_START while( decode < end_decode ) { float alpha = decode[1]; @@ -4283,6 +4391,7 @@ static void stbir__simple_alpha_unweight_4ch( float * encode_buffer, int width_t float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer; float const * end_output = encode_buffer + width_times_channels; + STBIR_SIMD_NO_UNROLL_LOOP_START do { float alpha = encode[3]; @@ -4330,9 +4439,77 @@ static void stbir__simple_flip_3ch( float * decode_buffer, int width_times_chann float STBIR_STREAMOUT_PTR(*) decode = decode_buffer; float const * end_decode = decode_buffer + width_times_channels; - decode += 12; +#ifdef STBIR_SIMD + #ifdef stbir__simdf_swiz2 // do we have two argument swizzles? + end_decode -= 12; + STBIR_NO_UNROLL_LOOP_START + while( decode <= end_decode ) + { + // on arm64 8 instructions, no overlapping stores + stbir__simdf a,b,c,na,nb; + STBIR_SIMD_NO_UNROLL(decode); + stbir__simdf_load( a, decode ); + stbir__simdf_load( b, decode+4 ); + stbir__simdf_load( c, decode+8 ); + + na = stbir__simdf_swiz2( a, b, 2, 1, 0, 5 ); + b = stbir__simdf_swiz2( a, b, 4, 3, 6, 7 ); + nb = stbir__simdf_swiz2( b, c, 0, 1, 4, 3 ); + c = stbir__simdf_swiz2( b, c, 2, 7, 6, 5 ); + + stbir__simdf_store( decode, na ); + stbir__simdf_store( decode+4, nb ); + stbir__simdf_store( decode+8, c ); + decode += 12; + } + end_decode += 12; + #else + end_decode -= 24; + STBIR_NO_UNROLL_LOOP_START + while( decode <= end_decode ) + { + // 26 instructions on x64 + stbir__simdf a,b,c,d,e,f,g; + float i21, i23; + STBIR_SIMD_NO_UNROLL(decode); + stbir__simdf_load( a, decode ); + stbir__simdf_load( b, decode+3 ); + stbir__simdf_load( c, decode+6 ); + stbir__simdf_load( d, decode+9 ); + stbir__simdf_load( e, decode+12 ); + stbir__simdf_load( f, decode+15 ); + stbir__simdf_load( g, decode+18 ); + + a = stbir__simdf_swiz( a, 2, 1, 0, 3 ); + b = stbir__simdf_swiz( b, 2, 1, 0, 3 ); + c = stbir__simdf_swiz( c, 2, 1, 0, 3 ); + d = stbir__simdf_swiz( d, 2, 1, 0, 3 ); + e = stbir__simdf_swiz( e, 2, 1, 0, 3 ); + f = stbir__simdf_swiz( f, 2, 1, 0, 3 ); + g = stbir__simdf_swiz( g, 2, 1, 0, 3 ); + + // stores overlap, need to be in order, + stbir__simdf_store( decode, a ); + i21 = decode[21]; + stbir__simdf_store( decode+3, b ); + i23 = decode[23]; + stbir__simdf_store( decode+6, c ); + stbir__simdf_store( decode+9, d ); + stbir__simdf_store( decode+12, e ); + stbir__simdf_store( decode+15, f ); + stbir__simdf_store( decode+18, g ); + decode[21] = i23; + decode[23] = i21; + decode += 24; + } + end_decode += 24; + #endif +#else + end_decode -= 12; + STBIR_NO_UNROLL_LOOP_START while( decode <= end_decode ) { + // 16 instructions float t0,t1,t2,t3; STBIR_NO_UNROLL(decode); t0 = decode[0]; t1 = decode[3]; t2 = decode[6]; t3 = decode[9]; @@ -4340,8 +4517,10 @@ static void stbir__simple_flip_3ch( float * decode_buffer, int width_times_chann decode[2] = t0; decode[5] = t1; decode[8] = t2; decode[11] = t3; decode += 12; } - decode -= 12; + end_decode += 12; +#endif + STBIR_NO_UNROLL_LOOP_START while( decode < end_decode ) { float t = decode[0]; @@ -4362,14 +4541,14 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir_edge edge_horizontal = stbir_info->horizontal.edge; stbir_edge edge_vertical = stbir_info->vertical.edge; int row = stbir__edge_wrap(edge_vertical, n, stbir_info->vertical.scale_info.input_full_size); - const void* input_plane_data = ( (char *) stbir_info->input_data ) + (ptrdiff_t)row * (ptrdiff_t) stbir_info->input_stride_bytes; + const void* input_plane_data = ( (char *) stbir_info->input_data ) + (size_t)row * (size_t) stbir_info->input_stride_bytes; stbir__span const * spans = stbir_info->scanline_extents.spans; float* full_decode_buffer = output_buffer - stbir_info->scanline_extents.conservative.n0 * effective_channels; // if we are on edge_zero, and we get in here with an out of bounds n, then the calculate filters has failed STBIR_ASSERT( !(edge_vertical == STBIR_EDGE_ZERO && (n < 0 || n >= stbir_info->vertical.scale_info.input_full_size)) ); - do + do { float * decode_buffer; void const * input_data; @@ -4377,7 +4556,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float int width_times_channels; int width; - if ( spans->n1 < spans->n0 ) + if ( spans->n1 < spans->n0 ) break; width = spans->n1 + 1 - spans->n0; @@ -4394,7 +4573,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float // call the callback with a temp buffer (that they can choose to use or not). the temp is just right aligned memory in the decode_buffer itself input_data = stbir_info->in_pixels_cb( ( (char*) end_decode ) - ( width * input_sample_in_bytes ), input_plane_data, width, spans->pixel_offset_for_input, row, stbir_info->user_data ); } - + STBIR_PROFILE_START( decode ); // convert the pixels info the float decode_buffer, (we index from end_decode, so that when channelsdecode_pixels( (float*)end_decode - width_times_channels, width_times_channels, input_data ); @@ -4418,7 +4597,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float // this code only runs if we're in edge_wrap, and we're doing the entire scanline int e, start_x[2]; int input_full_size = stbir_info->horizontal.scale_info.input_full_size; - + start_x[0] = -stbir_info->scanline_extents.edge_sizes[0]; // left edge start x start_x[1] = input_full_size; // right edge @@ -4447,7 +4626,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf tot,c; \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf_load1( c, hc ); \ - stbir__simdf_mult1_mem( tot, c, decode ); + stbir__simdf_mult1_mem( tot, c, decode ); #define stbir__2_coeff_only() \ stbir__simdf tot,c,d; \ @@ -4456,7 +4635,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_load2( d, decode ); \ stbir__simdf_mult( tot, c, d ); \ stbir__simdf_0123to1230( c, tot ); \ - stbir__simdf_add1( tot, tot, c ); + stbir__simdf_add1( tot, tot, c ); #define stbir__3_coeff_only() \ stbir__simdf tot,c,t; \ @@ -4466,7 +4645,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_0123to1230( c, tot ); \ stbir__simdf_0123to2301( t, tot ); \ stbir__simdf_add1( tot, tot, c ); \ - stbir__simdf_add1( tot, tot, t ); + stbir__simdf_add1( tot, tot, t ); #define stbir__store_output_tiny() \ stbir__simdf_store1( output, tot ); \ @@ -4483,7 +4662,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float #define stbir__4_coeff_continue_from_4( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf_load( c, hc + (ofs) ); \ - stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) ); + stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) ); #define stbir__1_coeff_remnant( ofs ) \ { stbir__simdf d; \ @@ -4495,7 +4674,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float { stbir__simdf d; \ stbir__simdf_load2z( c, hc+(ofs) ); \ stbir__simdf_load2( d, decode+(ofs) ); \ - stbir__simdf_madd( tot, tot, d, c ); } + stbir__simdf_madd( tot, tot, d, c ); } #define stbir__3_coeff_setup() \ stbir__simdf mask; \ @@ -4520,18 +4699,18 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float #define stbir__1_coeff_only() \ float tot; \ - tot = decode[0]*hc[0]; + tot = decode[0]*hc[0]; #define stbir__2_coeff_only() \ float tot; \ tot = decode[0] * hc[0]; \ - tot += decode[1] * hc[1]; + tot += decode[1] * hc[1]; #define stbir__3_coeff_only() \ float tot; \ tot = decode[0] * hc[0]; \ tot += decode[1] * hc[1]; \ - tot += decode[2] * hc[2]; + tot += decode[2] * hc[2]; #define stbir__store_output_tiny() \ output[0] = tot; \ @@ -4544,16 +4723,16 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float tot0 = decode[0] * hc[0]; \ tot1 = decode[1] * hc[1]; \ tot2 = decode[2] * hc[2]; \ - tot3 = decode[3] * hc[3]; + tot3 = decode[3] * hc[3]; #define stbir__4_coeff_continue_from_4( ofs ) \ tot0 += decode[0+(ofs)] * hc[0+(ofs)]; \ tot1 += decode[1+(ofs)] * hc[1+(ofs)]; \ tot2 += decode[2+(ofs)] * hc[2+(ofs)]; \ - tot3 += decode[3+(ofs)] * hc[3+(ofs)]; + tot3 += decode[3+(ofs)] * hc[3+(ofs)]; #define stbir__1_coeff_remnant( ofs ) \ - tot0 += decode[0+(ofs)] * hc[0+(ofs)]; + tot0 += decode[0+(ofs)] * hc[0+(ofs)]; #define stbir__2_coeff_remnant( ofs ) \ tot0 += decode[0+(ofs)] * hc[0+(ofs)]; \ @@ -4562,7 +4741,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float #define stbir__3_coeff_remnant( ofs ) \ tot0 += decode[0+(ofs)] * hc[0+(ofs)]; \ tot1 += decode[1+(ofs)] * hc[1+(ofs)]; \ - tot2 += decode[2+(ofs)] * hc[2+(ofs)]; + tot2 += decode[2+(ofs)] * hc[2+(ofs)]; #define stbir__store_output() \ output[0] = (tot0+tot2)+(tot1+tot3); \ @@ -4570,7 +4749,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float ++horizontal_contributors; \ output += 1; -#endif +#endif #define STBIR__horizontal_channels 1 #define STB_IMAGE_RESIZE_DO_HORIZONTALS @@ -4588,14 +4767,14 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_load1z( c, hc ); \ stbir__simdf_0123to0011( c, c ); \ stbir__simdf_load2( d, decode ); \ - stbir__simdf_mult( tot, d, c ); + stbir__simdf_mult( tot, d, c ); #define stbir__2_coeff_only() \ stbir__simdf tot,c; \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf_load2( c, hc ); \ stbir__simdf_0123to0011( c, c ); \ - stbir__simdf_mult_mem( tot, c, decode ); + stbir__simdf_mult_mem( tot, c, decode ); #define stbir__3_coeff_only() \ stbir__simdf tot,c,cs,d; \ @@ -4605,7 +4784,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_mult_mem( tot, c, decode ); \ stbir__simdf_0123to2222( c, cs ); \ stbir__simdf_load2z( d, decode+4 ); \ - stbir__simdf_madd( tot, tot, d, c ); + stbir__simdf_madd( tot, tot, d, c ); #define stbir__store_output_tiny() \ stbir__simdf_0123to2301( c, tot ); \ @@ -4628,7 +4807,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf8_load4b( cs, hc + (ofs) ); \ stbir__simdf8_0123to00112233( c, cs ); \ - stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); + stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); #define stbir__1_coeff_remnant( ofs ) \ { stbir__simdf t; \ @@ -4649,13 +4828,13 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf8_load4b( cs, hc + (ofs) ); \ stbir__simdf8_0123to00112233( c, cs ); \ stbir__simdf8_load6z( d, decode+(ofs)*2 ); \ - stbir__simdf8_madd( tot0, tot0, c, d ); } + stbir__simdf8_madd( tot0, tot0, c, d ); } #define stbir__store_output() \ - { stbir__simdf t,c; \ + { stbir__simdf t,d; \ stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 ); \ - stbir__simdf_0123to2301( c, t ); \ - stbir__simdf_add( t, t, c ); \ + stbir__simdf_0123to2301( d, t ); \ + stbir__simdf_add( t, t, d ); \ stbir__simdf_store2( output, t ); \ horizontal_coefficients += coefficient_width; \ ++horizontal_contributors; \ @@ -4670,7 +4849,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_0123to0011( c, cs ); \ stbir__simdf_mult_mem( tot0, c, decode ); \ stbir__simdf_0123to2233( c, cs ); \ - stbir__simdf_mult_mem( tot1, c, decode+4 ); + stbir__simdf_mult_mem( tot1, c, decode+4 ); #define stbir__4_coeff_continue_from_4( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -4678,7 +4857,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_0123to0011( c, cs ); \ stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); \ stbir__simdf_0123to2233( c, cs ); \ - stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*2+4 ); + stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*2+4 ); #define stbir__1_coeff_remnant( ofs ) \ { stbir__simdf d; \ @@ -4690,7 +4869,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float #define stbir__2_coeff_remnant( ofs ) \ stbir__simdf_load2( cs, hc + (ofs) ); \ stbir__simdf_0123to0011( c, cs ); \ - stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); + stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); #define stbir__3_coeff_remnant( ofs ) \ { stbir__simdf d; \ @@ -4699,7 +4878,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); \ stbir__simdf_0123to2222( c, cs ); \ stbir__simdf_load2z( d, decode + (ofs) * 2 + 4 ); \ - stbir__simdf_madd( tot1, tot1, d, c ); } + stbir__simdf_madd( tot1, tot1, d, c ); } #define stbir__store_output() \ stbir__simdf_add( tot0, tot0, tot1 ); \ @@ -4718,7 +4897,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float float tota,totb,c; \ c = hc[0]; \ tota = decode[0]*c; \ - totb = decode[1]*c; + totb = decode[1]*c; #define stbir__2_coeff_only() \ float tota,totb,c; \ @@ -4727,7 +4906,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float totb = decode[1]*c; \ c = hc[1]; \ tota += decode[2]*c; \ - totb += decode[3]*c; + totb += decode[3]*c; // this weird order of add matches the simd #define stbir__3_coeff_only() \ @@ -4740,7 +4919,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float totb += decode[5]*c; \ c = hc[1]; \ tota += decode[2]*c; \ - totb += decode[3]*c; + totb += decode[3]*c; #define stbir__store_output_tiny() \ output[0] = tota; \ @@ -4762,7 +4941,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float totb2 = decode[5]*c; \ c = hc[3]; \ tota3 = decode[6]*c; \ - totb3 = decode[7]*c; + totb3 = decode[7]*c; #define stbir__4_coeff_continue_from_4( ofs ) \ c = hc[0+(ofs)]; \ @@ -4776,12 +4955,12 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float totb2 += decode[5+(ofs)*2]*c; \ c = hc[3+(ofs)]; \ tota3 += decode[6+(ofs)*2]*c; \ - totb3 += decode[7+(ofs)*2]*c; + totb3 += decode[7+(ofs)*2]*c; #define stbir__1_coeff_remnant( ofs ) \ c = hc[0+(ofs)]; \ tota0 += decode[0+(ofs)*2] * c; \ - totb0 += decode[1+(ofs)*2] * c; + totb0 += decode[1+(ofs)*2] * c; #define stbir__2_coeff_remnant( ofs ) \ c = hc[0+(ofs)]; \ @@ -4789,7 +4968,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float totb0 += decode[1+(ofs)*2] * c; \ c = hc[1+(ofs)]; \ tota1 += decode[2+(ofs)*2] * c; \ - totb1 += decode[3+(ofs)*2] * c; + totb1 += decode[3+(ofs)*2] * c; #define stbir__3_coeff_remnant( ofs ) \ c = hc[0+(ofs)]; \ @@ -4800,7 +4979,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float totb1 += decode[3+(ofs)*2] * c; \ c = hc[2+(ofs)]; \ tota2 += decode[4+(ofs)*2] * c; \ - totb2 += decode[5+(ofs)*2] * c; + totb2 += decode[5+(ofs)*2] * c; #define stbir__store_output() \ output[0] = (tota0+tota2)+(tota1+tota3); \ @@ -4809,7 +4988,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float ++horizontal_contributors; \ output += 2; -#endif +#endif #define STBIR__horizontal_channels 2 #define STB_IMAGE_RESIZE_DO_HORIZONTALS @@ -4827,7 +5006,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_load1z( c, hc ); \ stbir__simdf_0123to0001( c, c ); \ stbir__simdf_load( d, decode ); \ - stbir__simdf_mult( tot, d, c ); + stbir__simdf_mult( tot, d, c ); #define stbir__2_coeff_only() \ stbir__simdf tot,c,cs,d; \ @@ -4838,7 +5017,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_mult( tot, d, c ); \ stbir__simdf_0123to1111( c, cs ); \ stbir__simdf_load( d, decode+3 ); \ - stbir__simdf_madd( tot, tot, d, c ); + stbir__simdf_madd( tot, tot, d, c ); #define stbir__3_coeff_only() \ stbir__simdf tot,c,d,cs; \ @@ -4852,7 +5031,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_madd( tot, tot, d, c ); \ stbir__simdf_0123to2222( c, cs ); \ stbir__simdf_load( d, decode+6 ); \ - stbir__simdf_madd( tot, tot, d, c ); + stbir__simdf_madd( tot, tot, d, c ); #define stbir__store_output_tiny() \ stbir__simdf_store2( output, tot ); \ @@ -4872,7 +5051,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf8_0123to00001111( c, cs ); \ stbir__simdf8_mult_mem( tot0, c, decode - 1 ); \ stbir__simdf8_0123to22223333( c, cs ); \ - stbir__simdf8_mult_mem( tot1, c, decode+6 - 1 ); + stbir__simdf8_mult_mem( tot1, c, decode+6 - 1 ); #define stbir__4_coeff_continue_from_4( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -4880,26 +5059,26 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf8_0123to00001111( c, cs ); \ stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \ stbir__simdf8_0123to22223333( c, cs ); \ - stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*3 + 6 - 1 ); + stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*3 + 6 - 1 ); #define stbir__1_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf_load1rep4( t, hc + (ofs) ); \ - stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*3 - 1 ); + stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*3 - 1 ); #define stbir__2_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf8_load4b( cs, hc + (ofs) - 2 ); \ stbir__simdf8_0123to22223333( c, cs ); \ - stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); - + stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); + #define stbir__3_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf8_load4b( cs, hc + (ofs) ); \ stbir__simdf8_0123to00001111( c, cs ); \ stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \ stbir__simdf8_0123to2222( t, cs ); \ - stbir__simdf8_madd_mem4( tot1, tot1, t, decode+(ofs)*3 + 6 - 1 ); + stbir__simdf8_madd_mem4( tot1, tot1, t, decode+(ofs)*3 + 6 - 1 ); #define stbir__store_output() \ stbir__simdf8_add( tot0, tot0, tot1 ); \ @@ -4930,7 +5109,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_0123to1122( c, cs ); \ stbir__simdf_mult_mem( tot1, c, decode+4 ); \ stbir__simdf_0123to2333( c, cs ); \ - stbir__simdf_mult_mem( tot2, c, decode+8 ); + stbir__simdf_mult_mem( tot2, c, decode+8 ); #define stbir__4_coeff_continue_from_4( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -4940,13 +5119,13 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_0123to1122( c, cs ); \ stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \ stbir__simdf_0123to2333( c, cs ); \ - stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*3+8 ); + stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*3+8 ); #define stbir__1_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf_load1z( c, hc + (ofs) ); \ stbir__simdf_0123to0001( c, c ); \ - stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); + stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); #define stbir__2_coeff_remnant( ofs ) \ { stbir__simdf d; \ @@ -4956,7 +5135,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); \ stbir__simdf_0123to1122( c, cs ); \ stbir__simdf_load2z( d, decode+(ofs)*3+4 ); \ - stbir__simdf_madd( tot1, tot1, c, d ); } + stbir__simdf_madd( tot1, tot1, c, d ); } #define stbir__3_coeff_remnant( ofs ) \ { stbir__simdf d; \ @@ -4968,7 +5147,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \ stbir__simdf_0123to2222( c, cs ); \ stbir__simdf_load1z( d, decode+(ofs)*3+8 ); \ - stbir__simdf_madd( tot2, tot2, c, d ); } + stbir__simdf_madd( tot2, tot2, c, d ); } #define stbir__store_output() \ stbir__simdf_0123ABCDto3ABx( c, tot0, tot1 ); \ @@ -4999,7 +5178,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float c = hc[0]; \ tot0 = decode[0]*c; \ tot1 = decode[1]*c; \ - tot2 = decode[2]*c; + tot2 = decode[2]*c; #define stbir__2_coeff_only() \ float tot0, tot1, tot2, c; \ @@ -5010,7 +5189,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float c = hc[1]; \ tot0 += decode[3]*c; \ tot1 += decode[4]*c; \ - tot2 += decode[5]*c; + tot2 += decode[5]*c; #define stbir__3_coeff_only() \ float tot0, tot1, tot2, c; \ @@ -5025,7 +5204,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float c = hc[2]; \ tot0 += decode[6]*c; \ tot1 += decode[7]*c; \ - tot2 += decode[8]*c; + tot2 += decode[8]*c; #define stbir__store_output_tiny() \ output[0] = tot0; \ @@ -5052,7 +5231,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float c = hc[3]; \ totd0 = decode[9]*c; \ totd1 = decode[10]*c; \ - totd2 = decode[11]*c; + totd2 = decode[11]*c; #define stbir__4_coeff_continue_from_4( ofs ) \ c = hc[0+(ofs)]; \ @@ -5070,7 +5249,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float c = hc[3+(ofs)]; \ totd0 += decode[9+(ofs)*3]*c; \ totd1 += decode[10+(ofs)*3]*c; \ - totd2 += decode[11+(ofs)*3]*c; + totd2 += decode[11+(ofs)*3]*c; #define stbir__1_coeff_remnant( ofs ) \ c = hc[0+(ofs)]; \ @@ -5100,7 +5279,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float c = hc[2+(ofs)]; \ totc0 += decode[6+(ofs)*3]*c; \ totc1 += decode[7+(ofs)*3]*c; \ - totc2 += decode[8+(ofs)*3]*c; + totc2 += decode[8+(ofs)*3]*c; #define stbir__store_output() \ output[0] = (tota0+totc0)+(totb0+totd0); \ @@ -5110,7 +5289,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float ++horizontal_contributors; \ output += 3; -#endif +#endif #define STBIR__horizontal_channels 3 #define STB_IMAGE_RESIZE_DO_HORIZONTALS @@ -5126,7 +5305,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf_load1( c, hc ); \ stbir__simdf_0123to0000( c, c ); \ - stbir__simdf_mult_mem( tot, c, decode ); + stbir__simdf_mult_mem( tot, c, decode ); #define stbir__2_coeff_only() \ stbir__simdf tot,c,cs; \ @@ -5135,7 +5314,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_0123to0000( c, cs ); \ stbir__simdf_mult_mem( tot, c, decode ); \ stbir__simdf_0123to1111( c, cs ); \ - stbir__simdf_madd_mem( tot, tot, c, decode+4 ); + stbir__simdf_madd_mem( tot, tot, c, decode+4 ); #define stbir__3_coeff_only() \ stbir__simdf tot,c,cs; \ @@ -5146,7 +5325,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_0123to1111( c, cs ); \ stbir__simdf_madd_mem( tot, tot, c, decode+4 ); \ stbir__simdf_0123to2222( c, cs ); \ - stbir__simdf_madd_mem( tot, tot, c, decode+8 ); + stbir__simdf_madd_mem( tot, tot, c, decode+8 ); #define stbir__store_output_tiny() \ stbir__simdf_store( output, tot ); \ @@ -5163,7 +5342,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf8_0123to00001111( c, cs ); \ stbir__simdf8_mult_mem( tot0, c, decode ); \ stbir__simdf8_0123to22223333( c, cs ); \ - stbir__simdf8_madd_mem( tot0, tot0, c, decode+8 ); + stbir__simdf8_madd_mem( tot0, tot0, c, decode+8 ); #define stbir__4_coeff_continue_from_4( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -5171,26 +5350,26 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf8_0123to00001111( c, cs ); \ stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \ stbir__simdf8_0123to22223333( c, cs ); \ - stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 ); + stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 ); #define stbir__1_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf_load1rep4( t, hc + (ofs) ); \ - stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4 ); + stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4 ); #define stbir__2_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf8_load4b( cs, hc + (ofs) - 2 ); \ stbir__simdf8_0123to22223333( c, cs ); \ - stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); - + stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); + #define stbir__3_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf8_load4b( cs, hc + (ofs) ); \ stbir__simdf8_0123to00001111( c, cs ); \ stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \ stbir__simdf8_0123to2222( t, cs ); \ - stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4+8 ); + stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4+8 ); #define stbir__store_output() \ stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 ); \ @@ -5199,7 +5378,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float ++horizontal_contributors; \ output += 4; -#else +#else #define stbir__4_coeff_start() \ stbir__simdf tot0,tot1,c,cs; \ @@ -5212,7 +5391,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_0123to2222( c, cs ); \ stbir__simdf_madd_mem( tot0, tot0, c, decode+8 ); \ stbir__simdf_0123to3333( c, cs ); \ - stbir__simdf_madd_mem( tot1, tot1, c, decode+12 ); + stbir__simdf_madd_mem( tot1, tot1, c, decode+12 ); #define stbir__4_coeff_continue_from_4( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -5224,13 +5403,13 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_0123to2222( c, cs ); \ stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 ); \ stbir__simdf_0123to3333( c, cs ); \ - stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+12 ); + stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+12 ); #define stbir__1_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf_load1( c, hc + (ofs) ); \ stbir__simdf_0123to0000( c, c ); \ - stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); + stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); #define stbir__2_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -5238,8 +5417,8 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_0123to0000( c, cs ); \ stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); \ stbir__simdf_0123to1111( c, cs ); \ - stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 ); - + stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 ); + #define stbir__3_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf_load( cs, hc + (ofs) ); \ @@ -5365,7 +5544,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float x0 += decode[0+(ofs)*4] * c; \ x1 += decode[1+(ofs)*4] * c; \ x2 += decode[2+(ofs)*4] * c; \ - x3 += decode[3+(ofs)*4] * c; + x3 += decode[3+(ofs)*4] * c; #define stbir__2_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -5378,8 +5557,8 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float y0 += decode[4+(ofs)*4] * c; \ y1 += decode[5+(ofs)*4] * c; \ y2 += decode[6+(ofs)*4] * c; \ - y3 += decode[7+(ofs)*4] * c; - + y3 += decode[7+(ofs)*4] * c; + #define stbir__3_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ c = hc[0+(ofs)]; \ @@ -5396,7 +5575,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float x0 += decode[8+(ofs)*4] * c; \ x1 += decode[9+(ofs)*4] * c; \ x2 += decode[10+(ofs)*4] * c; \ - x3 += decode[11+(ofs)*4] * c; + x3 += decode[11+(ofs)*4] * c; #define stbir__store_output() \ output[0] = x0 + y0; \ @@ -5407,7 +5586,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float ++horizontal_contributors; \ output += 4; -#endif +#endif #define STBIR__horizontal_channels 4 #define STB_IMAGE_RESIZE_DO_HORIZONTALS @@ -5426,7 +5605,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_load1( c, hc ); \ stbir__simdf_0123to0000( c, c ); \ stbir__simdf_mult_mem( tot0, c, decode ); \ - stbir__simdf_mult_mem( tot1, c, decode+3 ); + stbir__simdf_mult_mem( tot1, c, decode+3 ); #define stbir__2_coeff_only() \ stbir__simdf tot0,tot1,c,cs; \ @@ -5437,7 +5616,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_mult_mem( tot1, c, decode+3 ); \ stbir__simdf_0123to1111( c, cs ); \ stbir__simdf_madd_mem( tot0, tot0, c, decode+7 ); \ - stbir__simdf_madd_mem( tot1, tot1, c,decode+10 ); + stbir__simdf_madd_mem( tot1, tot1, c,decode+10 ); #define stbir__3_coeff_only() \ stbir__simdf tot0,tot1,c,cs; \ @@ -5451,7 +5630,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_madd_mem( tot1, tot1, c, decode+10 ); \ stbir__simdf_0123to2222( c, cs ); \ stbir__simdf_madd_mem( tot0, tot0, c, decode+14 ); \ - stbir__simdf_madd_mem( tot1, tot1, c, decode+17 ); + stbir__simdf_madd_mem( tot1, tot1, c, decode+17 ); #define stbir__store_output_tiny() \ stbir__simdf_store( output+3, tot1 ); \ @@ -5473,7 +5652,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf8_0123to22222222( c, cs ); \ stbir__simdf8_madd_mem( tot0, tot0, c, decode+14 ); \ stbir__simdf8_0123to33333333( c, cs ); \ - stbir__simdf8_madd_mem( tot1, tot1, c, decode+21 ); + stbir__simdf8_madd_mem( tot1, tot1, c, decode+21 ); #define stbir__4_coeff_continue_from_4( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -5485,19 +5664,19 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf8_0123to22222222( c, cs ); \ stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \ stbir__simdf8_0123to33333333( c, cs ); \ - stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+21 ); + stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+21 ); #define stbir__1_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf8_load1b( c, hc + (ofs) ); \ - stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); + stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); #define stbir__2_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf8_load1b( c, hc + (ofs) ); \ stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 ); \ stbir__simdf8_load1b( c, hc + (ofs)+1 ); \ - stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 ); + stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 ); #define stbir__3_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -5507,7 +5686,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf8_0123to11111111( c, cs ); \ stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 ); \ stbir__simdf8_0123to22222222( c, cs ); \ - stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); + stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); #define stbir__store_output() \ stbir__simdf8_add( tot0, tot0, tot1 ); \ @@ -5540,7 +5719,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_madd_mem( tot1, tot1, c, decode+17 ); \ stbir__simdf_0123to3333( c, cs ); \ stbir__simdf_madd_mem( tot2, tot2, c, decode+21 ); \ - stbir__simdf_madd_mem( tot3, tot3, c, decode+24 ); + stbir__simdf_madd_mem( tot3, tot3, c, decode+24 ); #define stbir__4_coeff_continue_from_4( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -5556,7 +5735,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 ); \ stbir__simdf_0123to3333( c, cs ); \ stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+21 ); \ - stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+24 ); + stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+24 ); #define stbir__1_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -5573,8 +5752,8 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 ); \ stbir__simdf_0123to1111( c, cs ); \ stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 ); \ - stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 ); - + stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 ); + #define stbir__3_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ stbir__simdf_load( cs, hc + (ofs) ); \ @@ -5586,7 +5765,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 ); \ stbir__simdf_0123to2222( c, cs ); \ stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \ - stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 ); + stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 ); #define stbir__store_output() \ stbir__simdf_add( tot0, tot0, tot2 ); \ @@ -5610,7 +5789,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float tot3 = decode[3]*c; \ tot4 = decode[4]*c; \ tot5 = decode[5]*c; \ - tot6 = decode[6]*c; + tot6 = decode[6]*c; #define stbir__2_coeff_only() \ float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \ @@ -5704,7 +5883,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float y3 += decode[24] * c; \ y4 += decode[25] * c; \ y5 += decode[26] * c; \ - y6 += decode[27] * c; + y6 += decode[27] * c; #define stbir__4_coeff_continue_from_4( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -5739,7 +5918,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float y3 += decode[24+(ofs)*7] * c; \ y4 += decode[25+(ofs)*7] * c; \ y5 += decode[26+(ofs)*7] * c; \ - y6 += decode[27+(ofs)*7] * c; + y6 += decode[27+(ofs)*7] * c; #define stbir__1_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ @@ -5770,7 +5949,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float y4 += decode[11+(ofs)*7] * c; \ y5 += decode[12+(ofs)*7] * c; \ y6 += decode[13+(ofs)*7] * c; \ - + #define stbir__3_coeff_remnant( ofs ) \ STBIR_SIMD_NO_UNROLL(decode); \ c = hc[0+(ofs)]; \ @@ -5810,7 +5989,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float ++horizontal_contributors; \ output += 7; -#endif +#endif #define STBIR__horizontal_channels 7 #define STB_IMAGE_RESIZE_DO_HORIZONTALS @@ -5937,7 +6116,7 @@ static void stbir__encode_scanline( stbir__info const * stbir_info, void *output // if we have an output callback, we first convert the decode buffer in place (and then hand that to the callback) if ( stbir_info->out_pixels_cb ) output_buffer = encode_buffer; - + STBIR_PROFILE_START( encode ); // convert into the output buffer stbir_info->encode_pixels( output_buffer, width_times_channels, encode_buffer ); @@ -5945,7 +6124,7 @@ static void stbir__encode_scanline( stbir__info const * stbir_info, void *output // if we have an output callback, call it to send the data if ( stbir_info->out_pixels_cb ) - stbir_info->out_pixels_cb( output_buffer_data, num_pixels, row, stbir_info->user_data ); + stbir_info->out_pixels_cb( output_buffer, num_pixels, row, stbir_info->user_data ); } @@ -6015,7 +6194,7 @@ static void stbir__resample_vertical_gather(stbir__info const * stbir_info, stbi stbir__resample_horizontal_gather(stbir_info, encode_buffer, decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); } - stbir__encode_scanline( stbir_info, ( (char *) stbir_info->output_data ) + ((ptrdiff_t)n * (ptrdiff_t)stbir_info->output_stride_bytes), + stbir__encode_scanline( stbir_info, ( (char *) stbir_info->output_data ) + ((size_t)n * (size_t)stbir_info->output_stride_bytes), encode_buffer, n STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); } @@ -6030,7 +6209,7 @@ static void stbir__decode_and_resample_for_vertical_gather_loop(stbir__info cons // update new end scanline split_info->ring_buffer_last_scanline = n; - // get ring buffer + // get ring buffer ring_buffer_index = (split_info->ring_buffer_begin_index + (split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline)) % stbir_info->ring_buffer_num_entries; ring_buffer = stbir__get_ring_buffer_entry(stbir_info, split_info, ring_buffer_index); @@ -6056,7 +6235,7 @@ static void stbir__vertical_gather_loop( stbir__info const * stbir_info, stbir__ // initialize the ring buffer for gathering split_info->ring_buffer_begin_index = 0; - split_info->ring_buffer_first_scanline = stbir_info->vertical.extent_info.lowest; + split_info->ring_buffer_first_scanline = vertical_contributors->n0; split_info->ring_buffer_last_scanline = split_info->ring_buffer_first_scanline - 1; // means "empty" for (y = start_output_y; y < end_output_y; y++) @@ -6080,12 +6259,12 @@ static void stbir__vertical_gather_loop( stbir__info const * stbir_info, stbir__ split_info->ring_buffer_first_scanline++; split_info->ring_buffer_begin_index++; } - + if ( stbir_info->vertical_first ) { float * ring_buffer = stbir__get_ring_buffer_scanline( stbir_info, split_info, ++split_info->ring_buffer_last_scanline ); // Decode the nth scanline from the source image into the decode buffer. - stbir__decode_scanline( stbir_info, split_info->ring_buffer_last_scanline, ring_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); + stbir__decode_scanline( stbir_info, split_info->ring_buffer_last_scanline, ring_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); } else { @@ -6108,10 +6287,10 @@ static void stbir__encode_first_scanline_from_scatter(stbir__info const * stbir_ { // evict a scanline out into the output buffer float* ring_buffer_entry = stbir__get_ring_buffer_entry(stbir_info, split_info, split_info->ring_buffer_begin_index ); - + // dump the scanline out - stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (ptrdiff_t)split_info->ring_buffer_first_scanline * (ptrdiff_t)stbir_info->output_stride_bytes ), ring_buffer_entry, split_info->ring_buffer_first_scanline STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); - + stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), ring_buffer_entry, split_info->ring_buffer_first_scanline STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); + // mark it as empty ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER; @@ -6129,10 +6308,10 @@ static void stbir__horizontal_resample_and_encode_first_scanline_from_scatter(st // Now resample it into the buffer. stbir__resample_horizontal_gather( stbir_info, split_info->vertical_buffer, ring_buffer_entry STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); - + // dump the scanline out - stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (ptrdiff_t)split_info->ring_buffer_first_scanline * (ptrdiff_t)stbir_info->output_stride_bytes ), split_info->vertical_buffer, split_info->ring_buffer_first_scanline STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); - + stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), split_info->vertical_buffer, split_info->ring_buffer_first_scanline STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); + // mark it as empty ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER; @@ -6172,7 +6351,7 @@ static void stbir__resample_vertical_scatter(stbir__info const * stbir_info, stb STBIR_PROFILE_END( vertical ); } -typedef void stbir__handle_scanline_for_scatter_func(stbir__info const * stbir_info, stbir__per_split_info* split_info); +typedef void stbir__handle_scanline_for_scatter_func(stbir__info const * stbir_info, stbir__per_split_info* split_info); static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir__per_split_info* split_info, int split_count ) { @@ -6193,7 +6372,7 @@ static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir_ end_input_y = split_info[split_count-1].end_input_y; // adjust for starting offset start_input_y - y = start_input_y + stbir_info->vertical.filter_pixel_margin; + y = start_input_y + stbir_info->vertical.filter_pixel_margin; vertical_contributors += y ; vertical_coefficients += stbir_info->vertical.coefficient_width * y; @@ -6240,7 +6419,7 @@ static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir_ split_info->start_input_y = y; on_first_input_y = 0; - // clip the region + // clip the region if ( out_first_scanline < start_output_y ) { vc += start_output_y - out_first_scanline; @@ -6253,11 +6432,11 @@ static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir_ // if very first scanline, init the index if (split_info->ring_buffer_begin_index < 0) split_info->ring_buffer_begin_index = out_first_scanline - start_output_y; - + STBIR_ASSERT( split_info->ring_buffer_begin_index <= out_first_scanline ); // Decode the nth scanline from the source image into the decode buffer. - stbir__decode_scanline( stbir_info, y, split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); + stbir__decode_scanline( stbir_info, y, split_info->decode_buffer STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); // When horizontal first, we resample horizontally into the vertical buffer before we scatter it out if ( !stbir_info->vertical_first ) @@ -6269,7 +6448,7 @@ static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir_ if ( ( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) == stbir_info->ring_buffer_num_entries ) && ( out_last_scanline > split_info->ring_buffer_last_scanline ) ) handle_scanline_for_scatter( stbir_info, split_info ); - + // Now the horizontal buffer is ready to write to all ring buffer rows, so do it. stbir__resample_vertical_scatter(stbir_info, split_info, out_first_scanline, out_last_scanline, vc, (float*)scanline_scatter_buffer, (float*)scanline_scatter_buffer_end ); @@ -6305,7 +6484,7 @@ static void stbir__set_sampler(stbir__sampler * samp, stbir_filter filter, stbir if (scale_info->scale >= ( 1.0f - stbir__small_float ) ) { if ( (scale_info->scale <= ( 1.0f + stbir__small_float ) ) && ( STBIR_CEILF(scale_info->pixel_shift) == scale_info->pixel_shift ) ) - filter = STBIR_FILTER_POINT_SAMPLE; + filter = STBIR_FILTER_POINT_SAMPLE; else filter = STBIR_DEFAULT_FILTER_UPSAMPLE; } @@ -6313,7 +6492,7 @@ static void stbir__set_sampler(stbir__sampler * samp, stbir_filter filter, stbir samp->filter_enum = filter; STBIR_ASSERT(samp->filter_enum != 0); - STBIR_ASSERT((unsigned)samp->filter_enum < STBIR_FILTER_OTHER); + STBIR_ASSERT((unsigned)samp->filter_enum < STBIR_FILTER_OTHER); samp->filter_kernel = stbir__builtin_kernels[ filter ]; samp->filter_support = stbir__builtin_supports[ filter ]; @@ -6339,15 +6518,31 @@ static void stbir__set_sampler(stbir__sampler * samp, stbir_filter filter, stbir // pre calculate stuff based on the above samp->coefficient_width = stbir__get_coefficient_width(samp, samp->is_gather, user_data); + // filter_pixel_width is the conservative size in pixels of input that affect an output pixel. + // In rare cases (only with 2 pix to 1 pix with the default filters), it's possible that the + // filter will extend before or after the scanline beyond just one extra entire copy of the + // scanline (we would hit the edge twice). We don't let you do that, so we clamp the total + // width to 3x the total of input pixel (once for the scanline, once for the left side + // overhang, and once for the right side). We only do this for edge mode, since the other + // modes can just re-edge clamp back in again. if ( edge == STBIR_EDGE_WRAP ) - if ( samp->filter_pixel_width > ( scale_info->input_full_size * 2 ) ) // this can only happen when shrinking to a single pixel - samp->filter_pixel_width = scale_info->input_full_size * 2; + if ( samp->filter_pixel_width > ( scale_info->input_full_size * 3 ) ) + samp->filter_pixel_width = scale_info->input_full_size * 3; // This is how much to expand buffers to account for filters seeking outside // the image boundaries. samp->filter_pixel_margin = samp->filter_pixel_width / 2; + + // filter_pixel_margin is the amount that this filter can overhang on just one side of either + // end of the scanline (left or the right). Since we only allow you to overhang 1 scanline's + // worth of pixels, we clamp this one side of overhang to the input scanline size. Again, + // this clamping only happens in rare cases with the default filters (2 pix to 1 pix). + if ( edge == STBIR_EDGE_WRAP ) + if ( samp->filter_pixel_margin > scale_info->input_full_size ) + samp->filter_pixel_margin = scale_info->input_full_size; samp->num_contributors = stbir__get_contributors(samp, samp->is_gather); + samp->contributors_size = samp->num_contributors * sizeof(stbir__contributors); samp->coefficients_size = samp->num_contributors * samp->coefficient_width * sizeof(float) + sizeof(float); // extra sizeof(float) is padding @@ -6397,8 +6592,8 @@ static void stbir__get_conservative_extents( stbir__sampler * samp, stbir__contr range->n0 = in_first_pixel; stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, (float)output_sub_size, 0, inv_scale, out_shift, input_full_size, edge ); range->n1 = in_last_pixel; - - // now go through the margin to the start of area to find bottom + + // now go through the margin to the start of area to find bottom n = range->n0 + 1; input_end = -filter_pixel_margin; while( n >= input_end ) @@ -6413,7 +6608,7 @@ static void stbir__get_conservative_extents( stbir__sampler * samp, stbir__contr --n; } - // now go through the end of the area through the margin to find top + // now go through the end of the area through the margin to find top n = range->n1 - 1; input_end = n + 1 + filter_pixel_margin; while( n <= input_end ) @@ -6462,7 +6657,7 @@ static void stbir__get_split_info( stbir__per_split_info* split_info, int splits cur = 0; for( i = 0 ; i < splits ; i++ ) { - int each; + int each; split_info[i].start_output_y = cur; each = left / ( splits - i ); split_info[i].end_output_y = cur + each; @@ -6478,7 +6673,7 @@ static void stbir__get_split_info( stbir__per_split_info* split_info, int splits static void stbir__free_internal_mem( stbir__info *info ) { #define STBIR__FREE_AND_CLEAR( ptr ) { if ( ptr ) { void * p = (ptr); (ptr) = 0; STBIR_FREE( p, info->user_data); } } - + if ( info ) { #ifndef STBIR__SEPARATE_ALLOCATIONS @@ -6496,16 +6691,16 @@ static void stbir__free_internal_mem( stbir__info *info ) for( j = 0 ; j < info->alloc_ring_buffer_num_entries ; j++ ) { #ifdef STBIR_SIMD8 - if ( info->effective_channels == 3 ) + if ( info->effective_channels == 3 ) --info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer - #endif + #endif STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers[j] ); } #ifdef STBIR_SIMD8 - if ( info->effective_channels == 3 ) + if ( info->effective_channels == 3 ) --info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer - #endif + #endif STBIR__FREE_AND_CLEAR( info->split_info[i].decode_buffer ); STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers ); STBIR__FREE_AND_CLEAR( info->split_info[i].vertical_buffer ); @@ -6522,7 +6717,7 @@ static void stbir__free_internal_mem( stbir__info *info ) STBIR__FREE_AND_CLEAR( info ); #endif } - + #undef STBIR__FREE_AND_CLEAR } @@ -6534,20 +6729,20 @@ static int stbir__get_max_split( int splits, int height ) for( i = 0 ; i < splits ; i++ ) { int each = height / ( splits - i ); - if ( each > max ) + if ( each > max ) max = each; height -= each; } return max; } -static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_n_coeffs_funcs[8] = -{ +static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_n_coeffs_funcs[8] = +{ 0, stbir__horizontal_gather_1_channels_with_n_coeffs_funcs, stbir__horizontal_gather_2_channels_with_n_coeffs_funcs, stbir__horizontal_gather_3_channels_with_n_coeffs_funcs, stbir__horizontal_gather_4_channels_with_n_coeffs_funcs, 0,0, stbir__horizontal_gather_7_channels_with_n_coeffs_funcs }; -static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_channels_funcs[8] = -{ +static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_channels_funcs[8] = +{ 0, stbir__horizontal_gather_1_channels_funcs, stbir__horizontal_gather_2_channels_funcs, stbir__horizontal_gather_3_channels_funcs, stbir__horizontal_gather_4_channels_funcs, 0,0, stbir__horizontal_gather_7_channels_funcs }; @@ -6622,28 +6817,28 @@ static STBIR__V_FIRST_INFO STBIR__V_FIRST_INFO_BUFFER = {0}; #endif // Figure out whether to scale along the horizontal or vertical first. -// This only *super* important when you are scaling by a massively -// different amount in the vertical vs the horizontal (for example, if -// you are scaling by 2x in the width, and 0.5x in the height, then you -// want to do the vertical scale first, because it's around 3x faster +// This only *super* important when you are scaling by a massively +// different amount in the vertical vs the horizontal (for example, if +// you are scaling by 2x in the width, and 0.5x in the height, then you +// want to do the vertical scale first, because it's around 3x faster // in that order. // -// In more normal circumstances, this makes a 20-40% differences, so +// In more normal circumstances, this makes a 20-40% differences, so // it's good to get right, but not critical. The normal way that you -// decide which direction goes first is just figuring out which -// direction does more multiplies. But with modern CPUs with their +// decide which direction goes first is just figuring out which +// direction does more multiplies. But with modern CPUs with their // fancy caches and SIMD and high IPC abilities, so there's just a lot -// more that goes into it. +// more that goes into it. // -// My handwavy sort of solution is to have an app that does a whole +// My handwavy sort of solution is to have an app that does a whole // bunch of timing for both vertical and horizontal first modes, // and then another app that can read lots of these timing files // and try to search for the best weights to use. Dotimings.c // is the app that does a bunch of timings, and vf_train.c is the -// app that solves for the best weights (and shows how well it +// app that solves for the best weights (and shows how well it // does currently). -static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLASSIFICATIONS][4], int horizontal_filter_pixel_width, float horizontal_scale, int horizontal_output_size, int vertical_filter_pixel_width, float vertical_scale, int vertical_output_size, int is_gather, STBIR__V_FIRST_INFO * info ) +static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLASSIFICATIONS][4], int horizontal_filter_pixel_width, float horizontal_scale, int horizontal_output_size, int vertical_filter_pixel_width, float vertical_scale, int vertical_output_size, int is_gather, STBIR__V_FIRST_INFO * info ) { double v_cost, h_cost; float * weights; @@ -6655,15 +6850,15 @@ static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLA v_classification = ( vertical_output_size < horizontal_output_size ) ? 6 : 7; else if ( vertical_scale <= 1.0f ) v_classification = ( is_gather ) ? 1 : 0; - else if ( vertical_scale <= 2.0f) + else if ( vertical_scale <= 2.0f) v_classification = 2; - else if ( vertical_scale <= 3.0f) + else if ( vertical_scale <= 3.0f) v_classification = 3; - else if ( vertical_scale <= 4.0f) + else if ( vertical_scale <= 4.0f) v_classification = 5; - else + else v_classification = 6; - + // use the right weights weights = weights_table[ v_classification ]; @@ -6684,10 +6879,10 @@ static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLA info->is_gather = is_gather; } - // and this allows us to override everything for testing (see dotiming.c) - if ( ( info ) && ( info->control_v_first ) ) + // and this allows us to override everything for testing (see dotiming.c) + if ( ( info ) && ( info->control_v_first ) ) vertical_first = ( info->control_v_first == 2 ) ? 1 : 0; - + return vertical_first; } @@ -6699,9 +6894,9 @@ static unsigned char stbir__pixel_channels[] = { }; // the internal pixel layout enums are in a different order, so we can easily do range comparisons of types -// the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible +// the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible static stbir_internal_pixel_layout stbir__pixel_layout_convert_public_to_internal[] = { - STBIRI_BGR, STBIRI_1CHANNEL, STBIRI_2CHANNEL, STBIRI_RGB, STBIRI_RGBA, + STBIRI_BGR, STBIRI_1CHANNEL, STBIRI_2CHANNEL, STBIRI_RGB, STBIRI_RGBA, STBIRI_4CHANNEL, STBIRI_BGRA, STBIRI_ARGB, STBIRI_ABGR, STBIRI_RA, STBIRI_AR, STBIRI_RGBA_PM, STBIRI_BGRA_PM, STBIRI_ARGB_PM, STBIRI_ABGR_PM, STBIRI_RA_PM, STBIRI_AR_PM, }; @@ -6712,17 +6907,17 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample stbir__info * info = 0; void * alloced = 0; - int alloced_total = 0; + size_t alloced_total = 0; int vertical_first; int decode_buffer_size, ring_buffer_length_bytes, ring_buffer_size, vertical_buffer_size, alloc_ring_buffer_num_entries; int alpha_weighting_type = 0; // 0=none, 1=simple, 2=fancy - int conservative_split_output_size = stbir__get_max_split( splits, vertical->scale_info.output_sub_size ); - stbir_internal_pixel_layout input_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ input_pixel_layout_public ]; + int conservative_split_output_size = stbir__get_max_split( splits, vertical->scale_info.output_sub_size ); + stbir_internal_pixel_layout input_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ input_pixel_layout_public ]; stbir_internal_pixel_layout output_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ output_pixel_layout_public ]; - int channels = stbir__pixel_channels[ input_pixel_layout ]; + int channels = stbir__pixel_channels[ input_pixel_layout ]; int effective_channels = channels; - + // first figure out what type of alpha weighting to use (if any) if ( ( horizontal->filter_enum != STBIR_FILTER_POINT_SAMPLE ) || ( vertical->filter_enum != STBIR_FILTER_POINT_SAMPLE ) ) // no alpha weighting on point sampling { @@ -6760,11 +6955,11 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample // sometimes read one float off in some of the unrolled loops (with a weight of zero coeff, so it doesn't have an effect) decode_buffer_size = ( conservative->n1 - conservative->n0 + 1 ) * effective_channels * sizeof(float) + sizeof(float); // extra float for padding - + #if defined( STBIR__SEPARATE_ALLOCATIONS ) && defined(STBIR_SIMD8) if ( effective_channels == 3 ) decode_buffer_size += sizeof(float); // avx in 3 channel mode needs one float at the start of the buffer (only with separate allocations) -#endif +#endif ring_buffer_length_bytes = horizontal->scale_info.output_sub_size * effective_channels * sizeof(float) + sizeof(float); // extra float for padding @@ -6803,9 +6998,9 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample #define STBIR__NEXT_PTR( ptr, size, ntype ) advance_mem = (void*) ( ( ((size_t)advance_mem) + 15 ) & ~15 ); if ( alloced ) ptr = (ntype*)advance_mem; advance_mem = ((char*)advance_mem) + (size); #endif - STBIR__NEXT_PTR( info, sizeof( stbir__info ), stbir__info ); + STBIR__NEXT_PTR( info, sizeof( stbir__info ), stbir__info ); - STBIR__NEXT_PTR( info->split_info, sizeof( stbir__per_split_info ) * splits, stbir__per_split_info ); + STBIR__NEXT_PTR( info->split_info, sizeof( stbir__per_split_info ) * splits, stbir__per_split_info ); if ( info ) { @@ -6820,39 +7015,39 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample info->channels = channels; info->effective_channels = effective_channels; - + info->offset_x = new_x; info->offset_y = new_y; info->alloc_ring_buffer_num_entries = alloc_ring_buffer_num_entries; - info->ring_buffer_num_entries = 0; + info->ring_buffer_num_entries = 0; info->ring_buffer_length_bytes = ring_buffer_length_bytes; info->splits = splits; info->vertical_first = vertical_first; - info->input_pixel_layout_internal = input_pixel_layout; + info->input_pixel_layout_internal = input_pixel_layout; info->output_pixel_layout_internal = output_pixel_layout; // setup alpha weight functions info->alpha_weight = 0; info->alpha_unweight = 0; - + // handle alpha weighting functions and overrides if ( alpha_weighting_type == 2 ) { // high quality alpha multiplying on the way in, dividing on the way out - info->alpha_weight = fancy_alpha_weights[ input_pixel_layout - STBIRI_RGBA ]; + info->alpha_weight = fancy_alpha_weights[ input_pixel_layout - STBIRI_RGBA ]; info->alpha_unweight = fancy_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ]; } else if ( alpha_weighting_type == 4 ) { // fast alpha multiplying on the way in, dividing on the way out - info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ]; + info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ]; info->alpha_unweight = simple_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ]; } else if ( alpha_weighting_type == 1 ) { // fast alpha on the way in, leave in premultiplied form on way out - info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ]; + info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ]; } else if ( alpha_weighting_type == 3 ) { @@ -6871,7 +7066,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample info->alpha_weight = stbir__simple_flip_3ch; } - } + } // get all the per-split buffers for( i = 0 ; i < splits ; i++ ) @@ -6883,7 +7078,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample #ifdef STBIR_SIMD8 if ( ( info ) && ( effective_channels == 3 ) ) ++info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer - #endif + #endif STBIR__NEXT_PTR( info->split_info[i].ring_buffers, alloc_ring_buffer_num_entries * sizeof(float*), float* ); { @@ -6894,7 +7089,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample #ifdef STBIR_SIMD8 if ( ( info ) && ( effective_channels == 3 ) ) ++info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer - #endif + #endif } } #else @@ -6922,10 +7117,10 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample #endif if ( temp_mem_amt >= both ) { - if ( info ) - { - vertical->gather_prescatter_contributors = (stbir__contributors*)info->split_info[0].decode_buffer; - vertical->gather_prescatter_coefficients = (float*) ( ( (char*)info->split_info[0].decode_buffer ) + vertical->gather_prescatter_contributors_size ); + if ( info ) + { + vertical->gather_prescatter_contributors = (stbir__contributors*)info->split_info[0].decode_buffer; + vertical->gather_prescatter_coefficients = (float*) ( ( (char*)info->split_info[0].decode_buffer ) + vertical->gather_prescatter_contributors_size ); } } else @@ -6948,7 +7143,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample if ( diff_shift < 0.0f ) diff_shift = -diff_shift; if ( ( diff_scale <= stbir__small_float ) && ( diff_shift <= stbir__small_float ) ) { - if ( horizontal->is_gather == vertical->is_gather ) + if ( horizontal->is_gather == vertical->is_gather ) { copy_horizontal = 1; goto no_vert_alloc; @@ -6975,16 +7170,16 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample // but if the number of coeffs <= 12, use another set of special cases. <=12 coeffs is any enlarging resize, or shrinking resize down to about 1/3 size if ( horizontal->extent_info.widest <= 12 ) info->horizontal_gather_channels = stbir__horizontal_gather_channels_funcs[ effective_channels ][ horizontal->extent_info.widest - 1 ]; - + info->scanline_extents.conservative.n0 = conservative->n0; info->scanline_extents.conservative.n1 = conservative->n1; - + // get exact extents stbir__get_extents( horizontal, &info->scanline_extents ); // pack the horizontal coeffs - horizontal->coefficient_width = stbir__pack_coefficients(horizontal->num_contributors, horizontal->contributors, horizontal->coefficients, horizontal->coefficient_width, horizontal->extent_info.widest, info->scanline_extents.conservative.n1 + 1 ); - + horizontal->coefficient_width = stbir__pack_coefficients(horizontal->num_contributors, horizontal->contributors, horizontal->coefficients, horizontal->coefficient_width, horizontal->extent_info.widest, info->scanline_extents.conservative.n0, info->scanline_extents.conservative.n1 ); + STBIR_MEMCPY( &info->horizontal, horizontal, sizeof( stbir__sampler ) ); STBIR_PROFILE_BUILD_END( horizontal ); @@ -7014,36 +7209,27 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample info->ring_buffer_num_entries = conservative_split_output_size; STBIR_ASSERT( info->ring_buffer_num_entries <= info->alloc_ring_buffer_num_entries ); - // a few of the horizontal gather functions read one dword past the end (but mask it out), so put in a normal value so no snans or denormals accidentally sneak in + // a few of the horizontal gather functions read past the end of the decode (but mask it out), + // so put in normal values so no snans or denormals accidentally sneak in (also, in the ring + // buffer for vertical first) for( i = 0 ; i < splits ; i++ ) { - int width, ofs; - - // find the right most span - if ( info->scanline_extents.spans[0].n1 > info->scanline_extents.spans[1].n1 ) - width = info->scanline_extents.spans[0].n1 - info->scanline_extents.spans[0].n0; - else - width = info->scanline_extents.spans[1].n1 - info->scanline_extents.spans[1].n0; - - // this calc finds the exact end of the decoded scanline for all filter modes. - // usually this is just the width * effective channels. But we have to account - // for the area to the left of the scanline for wrap filtering and alignment, this - // is stored as a negative value in info->scanline_extents.conservative.n0. Next, - // we need to skip the exact size of the right hand size filter area (again for - // wrap mode), this is in info->scanline_extents.edge_sizes[1]). - ofs = ( width + 1 - info->scanline_extents.conservative.n0 + info->scanline_extents.edge_sizes[1] ) * effective_channels; - - // place a known, but numerically valid value in the decode buffer - info->split_info[i].decode_buffer[ ofs ] = 9999.0f; + int t, ofs, start; + + ofs = decode_buffer_size / 4; + start = ofs - 4; + if ( start < 0 ) start = 0; + + for( t = start ; t < ofs; t++ ) + info->split_info[i].decode_buffer[ t ] = 9999.0f; - // if vertical filtering first, place a known, but numerically valid value in the all - // of the ring buffer accumulators if ( vertical_first ) { - int j; + int j; for( j = 0; j < info->ring_buffer_num_entries ; j++ ) { - stbir__get_ring_buffer_entry( info, info->split_info + i, j )[ ofs ] = 9999.0f; + for( t = start ; t < ofs; t++ ) + stbir__get_ring_buffer_entry( info, info->split_info + i, j )[ t ] = 9999.0f; } } } @@ -7055,7 +7241,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample // is this the first time through loop? if ( info == 0 ) { - alloced_total = (int) ( 15 + (size_t)advance_mem ); + alloced_total = ( 15 + (size_t)advance_mem ); alloced = STBIR_MALLOC( alloced_total, user_data ); if ( alloced == 0 ) return 0; @@ -7065,7 +7251,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample } } -static int stbir__perform_resize( stbir__info const * info, int split_start, int split_count ) +static int stbir__perform_resize( stbir__info const * info, int split_start, int split_count ) { stbir__per_split_info * split_info = info->split_info + split_start; @@ -7085,7 +7271,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r { static stbir__decode_pixels_func * decode_simple[STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]= { - /* 1ch-4ch */ stbir__decode_uint8_srgb, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear, + /* 1ch-4ch */ stbir__decode_uint8_srgb, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear, }; static stbir__decode_pixels_func * decode_alphas[STBIRI_AR-STBIRI_RGBA+1][STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]= @@ -7148,7 +7334,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r stbir_datatype input_type, output_type; input_type = resize->input_data_type; - output_type = resize->output_data_type; + output_type = resize->output_data_type; info->input_data = resize->input_pixels; info->input_stride_bytes = resize->input_stride_in_bytes; info->output_stride_bytes = resize->output_stride_in_bytes; @@ -7156,7 +7342,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r // if we're completely point sampling, then we can turn off SRGB if ( ( info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE ) && ( info->vertical.filter_enum == STBIR_FILTER_POINT_SAMPLE ) ) { - if ( ( ( input_type == STBIR_TYPE_UINT8_SRGB ) || ( input_type == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) && + if ( ( ( input_type == STBIR_TYPE_UINT8_SRGB ) || ( input_type == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) && ( ( output_type == STBIR_TYPE_UINT8_SRGB ) || ( output_type == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) ) { input_type = STBIR_TYPE_UINT8; @@ -7164,7 +7350,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r } } - // recalc the output and input strides + // recalc the output and input strides if ( info->input_stride_bytes == 0 ) info->input_stride_bytes = info->channels * info->horizontal.scale_info.input_full_size * stbir__type_size[input_type]; @@ -7172,7 +7358,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r info->output_stride_bytes = info->channels * info->horizontal.scale_info.output_sub_size * stbir__type_size[output_type]; // calc offset - info->output_data = ( (char*) resize->output_pixels ) + ( (ptrdiff_t) info->offset_y * (ptrdiff_t) resize->output_stride_in_bytes ) + ( info->offset_x * info->channels * stbir__type_size[output_type] ); + info->output_data = ( (char*) resize->output_pixels ) + ( (size_t) info->offset_y * (size_t) resize->output_stride_in_bytes ) + ( info->offset_x * info->channels * stbir__type_size[output_type] ); info->in_pixels_cb = resize->input_cb; info->user_data = resize->user_data; @@ -7205,7 +7391,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r if ( ( output_type == STBIR_TYPE_UINT8 ) || ( output_type == STBIR_TYPE_UINT16 ) ) { int non_scaled = 0; - + // check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0 (which is a tiny bit faster when doing linear 8->8 or 16->16) if ( ( !info->alpha_weight ) && ( !info->alpha_unweight ) ) // don't short circuit when alpha weighting (get everything to 0-1.0 as usual) if ( ( ( input_type == STBIR_TYPE_UINT8 ) && ( output_type == STBIR_TYPE_UINT8 ) ) || ( ( input_type == STBIR_TYPE_UINT16 ) && ( output_type == STBIR_TYPE_UINT16 ) ) ) @@ -7225,16 +7411,16 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r } info->input_type = input_type; - info->output_type = output_type; + info->output_type = output_type; info->decode_pixels = decode_pixels; - info->encode_pixels = encode_pixels; + info->encode_pixels = encode_pixels; } static void stbir__clip( int * outx, int * outsubw, int outw, double * u0, double * u1 ) { double per, adj; int over; - + // do left/top edge if ( *outx < 0 ) { @@ -7253,7 +7439,7 @@ static void stbir__clip( int * outx, int * outsubw, int outw, double * u0, doubl *u1 += adj; // decrease u1 *outsubw = outw - *outx; } -} +} // converts a double to a rational that has less than one float bit of error (returns 0 if unable to do so) static int stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32 *numer, stbir_uint32 *denom, int limit_denom ) // limit_denom (1) or limit numer (0) @@ -7270,7 +7456,7 @@ static int stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32 bot = 1 << 25; // keep refining, but usually stops in a few loops - usually 5 for bad cases - for(;;) + for(;;) { stbir_uint64 est, temp; @@ -7303,13 +7489,13 @@ static int stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32 bot = temp; // move remainders - temp = est * denom_estimate + denom_last; - denom_last = denom_estimate; + temp = est * denom_estimate + denom_last; + denom_last = denom_estimate; denom_estimate = temp; // move remainders - temp = est * numer_estimate + numer_last; - numer_last = numer_estimate; + temp = est * numer_estimate + numer_last; + numer_last = numer_estimate; numer_estimate = temp; } @@ -7353,11 +7539,11 @@ static int stbir__calculate_region_transform( stbir__scale_info * scale_info, in output_s = ( (double)output_sub_range) / output_range; - // figure out the scaling to use - ratio = output_s / input_s; + // figure out the scaling to use + ratio = output_s / input_s; // save scale before clipping - scale = ( output_range / input_range ) * ratio; + scale = ( output_range / input_range ) * ratio; scale_info->scale = (float)scale; scale_info->inv_scale = (float)( 1.0 / scale ); @@ -7368,11 +7554,11 @@ static int stbir__calculate_region_transform( stbir__scale_info * scale_info, in input_s = input_s1 - input_s0; // after clipping do we have zero input area? - if ( input_s <= stbir__small_float ) + if ( input_s <= stbir__small_float ) return 0; - // calculate and store the starting source offsets in output pixel space - scale_info->pixel_shift = (float) ( input_s0 * ratio * output_range ); + // calculate and store the starting source offsets in output pixel space + scale_info->pixel_shift = (float) ( input_s0 * ratio * output_range ); scale_info->scale_is_rational = stbir__double_to_rational( scale, ( scale <= 1.0 ) ? output_full_range : input_full_range, &scale_info->scale_numerator, &scale_info->scale_denominator, ( scale >= 1.0 ) ); @@ -7389,7 +7575,6 @@ static void stbir__init_and_set_layout( STBIR_RESIZE * resize, stbir_pixel_layou resize->output_cb = 0; resize->user_data = resize; resize->samplers = 0; - resize->needs_rebuild = 1; resize->called_alloc = 0; resize->horizontal_filter = STBIR_FILTER_DEFAULT; resize->horizontal_filter_kernel = 0; resize->horizontal_filter_support = 0; @@ -7403,9 +7588,10 @@ static void stbir__init_and_set_layout( STBIR_RESIZE * resize, stbir_pixel_layou resize->output_data_type = data_type; resize->input_pixel_layout_public = pixel_layout; resize->output_pixel_layout_public = pixel_layout; + resize->needs_rebuild = 1; } -STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize, +STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize, const void *input_pixels, int input_w, int input_h, int input_stride_in_bytes, // stride can be zero void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, // stride can be zero stbir_pixel_layout pixel_layout, stbir_datatype data_type ) @@ -7428,17 +7614,27 @@ STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_t { resize->input_data_type = input_type; resize->output_data_type = output_type; + if ( ( resize->samplers ) && ( !resize->needs_rebuild ) ) + stbir__update_info_from_resize( resize->samplers, resize ); } STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb ) // no callbacks by default { resize->input_cb = input_cb; resize->output_cb = output_cb; + + if ( ( resize->samplers ) && ( !resize->needs_rebuild ) ) + { + resize->samplers->in_pixels_cb = input_cb; + resize->samplers->out_pixels_cb = output_cb; + } } STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data ) // pass back STBIR_RESIZE* by default { resize->user_data = user_data; + if ( ( resize->samplers ) && ( !resize->needs_rebuild ) ) + resize->samplers->user_data = user_data; } STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes ) @@ -7447,6 +7643,8 @@ STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_p resize->input_stride_in_bytes = input_stride_in_bytes; resize->output_pixels = output_pixels; resize->output_stride_in_bytes = output_stride_in_bytes; + if ( ( resize->samplers ) && ( !resize->needs_rebuild ) ) + stbir__update_info_from_resize( resize->samplers, resize ); } @@ -7549,7 +7747,7 @@ STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, return 1; } -static int stbir__perform_build( STBIR_RESIZE * resize, int splits ) +static int stbir__perform_build( STBIR_RESIZE * resize, int splits ) { stbir__contributors conservative = { 0, 0 }; stbir__sampler horizontal, vertical; @@ -7563,13 +7761,13 @@ static int stbir__perform_build( STBIR_RESIZE * resize, int splits ) // have we already built the samplers? if ( resize->samplers ) return 0; - + #define STBIR_RETURN_ERROR_AND_ASSERT( exp ) STBIR_ASSERT( !(exp) ); if (exp) return 0; STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->horizontal_filter >= STBIR_FILTER_OTHER) STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->vertical_filter >= STBIR_FILTER_OTHER) #undef STBIR_RETURN_ERROR_AND_ASSERT - if ( splits <= 0 ) + if ( splits <= 0 ) return 0; STBIR_PROFILE_BUILD_FIRST_START( build ); @@ -7593,9 +7791,9 @@ static int stbir__perform_build( STBIR_RESIZE * resize, int splits ) stbir__get_conservative_extents( &horizontal, &conservative, resize->user_data ); stbir__set_sampler(&vertical, resize->vertical_filter, resize->horizontal_filter_kernel, resize->vertical_filter_support, resize->vertical_edge, &vertical.scale_info, 0, resize->user_data ); - if ( ( vertical.scale_info.output_sub_size / splits ) < 4 ) // each split should be a minimum of 4 scanlines (handwavey choice) + if ( ( vertical.scale_info.output_sub_size / splits ) < STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS ) // each split should be a minimum of 4 scanlines (handwavey choice) { - splits = vertical.scale_info.output_sub_size / 4; + splits = vertical.scale_info.output_sub_size / STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS; if ( splits == 0 ) splits = 1; } @@ -7603,7 +7801,7 @@ static int stbir__perform_build( STBIR_RESIZE * resize, int splits ) out_info = stbir__alloc_internal_mem_and_build_samplers( &horizontal, &vertical, &conservative, resize->input_pixel_layout_public, resize->output_pixel_layout_public, splits, new_output_subx, new_output_suby, resize->fast_alpha, resize->user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO ); STBIR_PROFILE_BUILD_END( alloc ); STBIR_PROFILE_BUILD_END( build ); - + if ( out_info ) { resize->splits = splits; @@ -7612,6 +7810,10 @@ static int stbir__perform_build( STBIR_RESIZE * resize, int splits ) #ifdef STBIR_PROFILE STBIR_MEMCPY( &out_info->profile, &profile_infod.profile, sizeof( out_info->profile ) ); #endif + + // update anything that can be changed without recalcing samplers + stbir__update_info_from_resize( out_info, resize ); + return splits; } @@ -7640,7 +7842,7 @@ STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int splits } STBIR_PROFILE_BUILD_CLEAR( resize->samplers ); - + return 1; } @@ -7652,7 +7854,7 @@ STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize ) STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize ) { int result; - + if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) ) { int alloc_state = resize->called_alloc; // remember allocated state @@ -7665,10 +7867,10 @@ STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize ) if ( !stbir_build_samplers( resize ) ) return 0; - + resize->called_alloc = alloc_state; - // if build_samplers succeeded (above), but there are no samplers set, then + // if build_samplers succeeded (above), but there are no samplers set, then // the area to stretch into was zero pixels, so don't do anything and return // success if ( resize->samplers == 0 ) @@ -7680,10 +7882,6 @@ STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize ) STBIR_PROFILE_BUILD_CLEAR( resize->samplers ); } - - // update anything that can be changed without recalcing samplers - stbir__update_info_from_resize( resize->samplers, resize ); - // do resize result = stbir__perform_resize( resize->samplers, 0, resize->splits ); @@ -7692,7 +7890,7 @@ STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize ) { stbir_free_samplers( resize ); resize->samplers = 0; - } + } return result; } @@ -7707,14 +7905,11 @@ STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start // you **must** build samplers first when using split resize if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) ) - return 0; - + return 0; + if ( ( split_start >= resize->splits ) || ( split_start < 0 ) || ( ( split_start + split_count ) > resize->splits ) || ( split_count <= 0 ) ) return 0; - - // update anything that can be changed without recalcing samplers - stbir__update_info_from_resize( resize->samplers, resize ); - + // do resize return stbir__perform_resize( resize->samplers, split_start, split_count ); } @@ -7735,7 +7930,7 @@ static int stbir__check_output_stuff( void ** ret_ptr, int * ret_pitch, void * o if ( output_stride_in_bytes < pitch ) return 0; - size = output_stride_in_bytes * output_h; + size = (size_t)output_stride_in_bytes * (size_t)output_h; if ( size == 0 ) return 0; @@ -7752,7 +7947,7 @@ static int stbir__check_output_stuff( void ** ret_ptr, int * ret_pitch, void * o *ret_pitch = pitch; } - return 1; + return 1; } @@ -7767,9 +7962,9 @@ STBIRDEF unsigned char * stbir_resize_uint8_linear( const unsigned char *input_p if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) ) return 0; - stbir_resize_init( &resize, - input_pixels, input_w, input_h, input_stride_in_bytes, - (optr) ? optr : output_pixels, output_w, output_h, opitch, + stbir_resize_init( &resize, + input_pixels, input_w, input_h, input_stride_in_bytes, + (optr) ? optr : output_pixels, output_w, output_h, opitch, pixel_layout, STBIR_TYPE_UINT8 ); if ( !stbir_resize_extended( &resize ) ) @@ -7793,9 +7988,9 @@ STBIRDEF unsigned char * stbir_resize_uint8_srgb( const unsigned char *input_pix if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) ) return 0; - stbir_resize_init( &resize, - input_pixels, input_w, input_h, input_stride_in_bytes, - (optr) ? optr : output_pixels, output_w, output_h, opitch, + stbir_resize_init( &resize, + input_pixels, input_w, input_h, input_stride_in_bytes, + (optr) ? optr : output_pixels, output_w, output_h, opitch, pixel_layout, STBIR_TYPE_UINT8_SRGB ); if ( !stbir_resize_extended( &resize ) ) @@ -7820,9 +8015,9 @@ STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int inpu if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( float ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) ) return 0; - stbir_resize_init( &resize, - input_pixels, input_w, input_h, input_stride_in_bytes, - (optr) ? optr : output_pixels, output_w, output_h, opitch, + stbir_resize_init( &resize, + input_pixels, input_w, input_h, input_stride_in_bytes, + (optr) ? optr : output_pixels, output_w, output_h, opitch, pixel_layout, STBIR_TYPE_FLOAT ); if ( !stbir_resize_extended( &resize ) ) @@ -7838,7 +8033,7 @@ STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int inpu STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int input_h, int input_stride_in_bytes, void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, - stbir_pixel_layout pixel_layout, stbir_datatype data_type, + stbir_pixel_layout pixel_layout, stbir_datatype data_type, stbir_edge edge, stbir_filter filter ) { STBIR_RESIZE resize; @@ -7848,9 +8043,9 @@ STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int input if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, stbir__type_size[data_type], output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) ) return 0; - stbir_resize_init( &resize, - input_pixels, input_w, input_h, input_stride_in_bytes, - (optr) ? optr : output_pixels, output_w, output_h, output_stride_in_bytes, + stbir_resize_init( &resize, + input_pixels, input_w, input_h, input_stride_in_bytes, + (optr) ? optr : output_pixels, output_w, output_h, output_stride_in_bytes, pixel_layout, data_type ); resize.horizontal_edge = edge; @@ -7958,7 +8153,7 @@ STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * info, STB #else // STB_IMAGE_RESIZE_HORIZONTALS&STB_IMAGE_RESIZE_DO_VERTICALS // we reinclude the header file to define all the horizontal functions -// specializing each function for the number of coeffs is 20-40% faster *OVERALL* +// specializing each function for the number of coeffs is 20-40% faster *OVERALL* // by including the header file again this way, we can still debug the functions @@ -7991,16 +8186,16 @@ STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * info, STB #define stbir__encode_order2 2 #define stbir__encode_order3 3 #define stbir__decode_simdf8_flip(reg) -#define stbir__decode_simdf4_flip(reg) +#define stbir__decode_simdf4_flip(reg) #define stbir__encode_simdf8_unflip(reg) -#define stbir__encode_simdf4_unflip(reg) +#define stbir__encode_simdf4_unflip(reg) #endif #ifdef STBIR_SIMD8 #define stbir__encode_simdfX_unflip stbir__encode_simdf8_unflip #else #define stbir__encode_simdfX_unflip stbir__encode_simdf4_unflip -#endif +#endif static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * decodep, int width_times_channels, void const * inputp ) { @@ -8013,6 +8208,7 @@ static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * deco if ( width_times_channels >= 16 ) { decode_end -= 16; + STBIR_NO_UNROLL_LOOP_START_INF_FOR for(;;) { #ifdef STBIR_SIMD8 @@ -8054,7 +8250,7 @@ static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * deco #endif decode += 16; input += 16; - if ( decode <= decode_end ) + if ( decode <= decode_end ) continue; if ( decode == ( decode_end + 16 ) ) break; @@ -8068,6 +8264,7 @@ static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * deco // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four decode += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while( decode <= decode_end ) { STBIR_SIMD_NO_UNROLL(decode); @@ -8083,6 +8280,7 @@ static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * deco // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( decode < decode_end ) { STBIR_NO_UNROLL(decode); @@ -8109,6 +8307,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu { float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2; end_output -= stbir__simdfX_float_count*2; + STBIR_NO_UNROLL_LOOP_START_INF_FOR for(;;) { stbir__simdfX e0, e1; @@ -8119,15 +8318,15 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu stbir__encode_simdfX_unflip( e0 ); stbir__encode_simdfX_unflip( e1 ); #ifdef STBIR_SIMD8 - stbir__simdf8_pack_to_16bytes( i, e0, e1 ); + stbir__simdf8_pack_to_16bytes( i, e0, e1 ); stbir__simdi_store( output, i ); #else - stbir__simdf_pack_to_8bytes( i, e0, e1 ); + stbir__simdf_pack_to_8bytes( i, e0, e1 ); stbir__simdi_store2( output, i ); #endif encode += stbir__simdfX_float_count*2; output += stbir__simdfX_float_count*2; - if ( output <= end_output ) + if ( output <= end_output ) continue; if ( output == ( end_output + stbir__simdfX_float_count*2 ) ) break; @@ -8140,6 +8339,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four output += 4; + STBIR_NO_UNROLL_LOOP_START while( output <= end_output ) { stbir__simdf e0; @@ -8158,9 +8358,10 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( output < end_output ) { - stbir__simdf e0; + stbir__simdf e0; STBIR_NO_UNROLL(encode); stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order0 ); output[0] = stbir__simdf_convert_float_to_uint8( e0 ); #if stbir__coder_min_num >= 2 @@ -8173,7 +8374,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu encode += stbir__coder_min_num; } #endif - + #else // try to do blocks of 4 when you can @@ -8194,6 +8395,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( output < end_output ) { float f; @@ -8223,6 +8425,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int if ( width_times_channels >= 16 ) { decode_end -= 16; + STBIR_NO_UNROLL_LOOP_START_INF_FOR for(;;) { #ifdef STBIR_SIMD8 @@ -8258,7 +8461,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int #endif decode += 16; input += 16; - if ( decode <= decode_end ) + if ( decode <= decode_end ) continue; if ( decode == ( decode_end + 16 ) ) break; @@ -8272,6 +8475,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four decode += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while( decode <= decode_end ) { STBIR_SIMD_NO_UNROLL(decode); @@ -8287,6 +8491,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( decode < decode_end ) { STBIR_NO_UNROLL(decode); @@ -8313,6 +8518,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int { float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2; end_output -= stbir__simdfX_float_count*2; + STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR for(;;) { stbir__simdfX e0, e1; @@ -8323,15 +8529,15 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int stbir__encode_simdfX_unflip( e0 ); stbir__encode_simdfX_unflip( e1 ); #ifdef STBIR_SIMD8 - stbir__simdf8_pack_to_16bytes( i, e0, e1 ); + stbir__simdf8_pack_to_16bytes( i, e0, e1 ); stbir__simdi_store( output, i ); #else - stbir__simdf_pack_to_8bytes( i, e0, e1 ); + stbir__simdf_pack_to_8bytes( i, e0, e1 ); stbir__simdi_store2( output, i ); #endif encode += stbir__simdfX_float_count*2; output += stbir__simdfX_float_count*2; - if ( output <= end_output ) + if ( output <= end_output ) continue; if ( output == ( end_output + stbir__simdfX_float_count*2 ) ) break; @@ -8344,6 +8550,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four output += 4; + STBIR_NO_UNROLL_LOOP_START while( output <= end_output ) { stbir__simdf e0; @@ -8382,6 +8589,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( output < end_output ) { float f; @@ -8422,6 +8630,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int wi // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( decode < decode_end ) { STBIR_NO_UNROLL(decode); @@ -8441,7 +8650,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int wi #define stbir__min_max_shift20( i, f ) \ stbir__simdf_max( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_zero )) ); \ stbir__simdf_min( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_one )) ); \ - stbir__simdi_32shr( i, stbir_simdi_castf( f ), 20 ); + stbir__simdi_32shr( i, stbir_simdi_castf( f ), 20 ); #define stbir__scale_and_convert( i, f ) \ stbir__simdf_madd( f, STBIR__CONSTF( STBIR_simd_point5 ), STBIR__CONSTF( STBIR_max_uint8_as_float ), f ); \ @@ -8468,7 +8677,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int wi temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \ v0 = temp0.m128i_i128; \ v1 = temp1.m128i_i128; \ -} +} #define stbir__simdi_table_lookup3( v0,v1,v2, table ) \ { \ @@ -8499,7 +8708,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int wi v1 = temp1.m128i_i128; \ v2 = temp2.m128i_i128; \ v3 = temp3.m128i_i128; \ -} +} static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int width_times_channels, float const * encode ) { @@ -8507,16 +8716,16 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int w unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels; #ifdef STBIR_SIMD - stbir_uint32 const * to_srgb = fp32_to_srgb8_tab4 - (127-13)*8; if ( width_times_channels >= 16 ) { float const * end_encode_m16 = encode + width_times_channels - 16; end_output -= 16; + STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR for(;;) { stbir__simdf f0, f1, f2, f3; - stbir__simdi i0, i1, i2, i3; + stbir__simdi i0, i1, i2, i3; STBIR_SIMD_NO_UNROLL(encode); stbir__simdf_load4_transposed( f0, f1, f2, f3, encode ); @@ -8525,9 +8734,9 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int w stbir__min_max_shift20( i1, f1 ); stbir__min_max_shift20( i2, f2 ); stbir__min_max_shift20( i3, f3 ); - - stbir__simdi_table_lookup4( i0, i1, i2, i3, to_srgb ); - + + stbir__simdi_table_lookup4( i0, i1, i2, i3, ( fp32_to_srgb8_tab4 - (127-13)*8 ) ); + stbir__linear_to_srgb_finish( i0, f0 ); stbir__linear_to_srgb_finish( i1, f1 ); stbir__linear_to_srgb_finish( i2, f2 ); @@ -8537,7 +8746,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int w encode += 16; output += 16; - if ( output <= end_output ) + if ( output <= end_output ) continue; if ( output == ( end_output + 16 ) ) break; @@ -8551,6 +8760,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int w // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four output += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while ( output <= end_output ) { STBIR_SIMD_NO_UNROLL(encode); @@ -8568,7 +8778,8 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int w // do the remnants #if stbir__coder_min_num < 4 - while( output < end_output ) + STBIR_NO_UNROLL_LOOP_START + while( output < end_output ) { STBIR_NO_UNROLL(encode); output[0] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order0] ); @@ -8608,12 +8819,12 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * o unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels; #ifdef STBIR_SIMD - stbir_uint32 const * to_srgb = fp32_to_srgb8_tab4 - (127-13)*8; if ( width_times_channels >= 16 ) { float const * end_encode_m16 = encode + width_times_channels - 16; end_output -= 16; + STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR for(;;) { stbir__simdf f0, f1, f2, f3; @@ -8625,10 +8836,10 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * o stbir__min_max_shift20( i0, f0 ); stbir__min_max_shift20( i1, f1 ); stbir__min_max_shift20( i2, f2 ); - stbir__scale_and_convert( i3, f3 ); - - stbir__simdi_table_lookup3( i0, i1, i2, to_srgb ); - + stbir__scale_and_convert( i3, f3 ); + + stbir__simdi_table_lookup3( i0, i1, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) ); + stbir__linear_to_srgb_finish( i0, f0 ); stbir__linear_to_srgb_finish( i1, f1 ); stbir__linear_to_srgb_finish( i2, f2 ); @@ -8638,7 +8849,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * o output += 16; encode += 16; - if ( output <= end_output ) + if ( output <= end_output ) continue; if ( output == ( end_output + 16 ) ) break; @@ -8649,9 +8860,10 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * o } #endif + STBIR_SIMD_NO_UNROLL_LOOP_START do { float f; - STBIR_SIMD_NO_UNROLL(encode); + STBIR_SIMD_NO_UNROLL(encode); output[stbir__decode_order0] = stbir__linear_to_srgb_uchar( encode[0] ); output[stbir__decode_order1] = stbir__linear_to_srgb_uchar( encode[1] ); @@ -8686,7 +8898,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_srgb2_linearalpha)( float * de decode += 4; } decode -= 4; - if( decode < decode_end ) + if( decode < decode_end ) { decode[0] = stbir__srgb_uchar_to_linear_float[ stbir__decode_order0 ]; decode[1] = ( (float) input[stbir__decode_order1] ) * stbir__max_uint8_as_float_inverted; @@ -8699,16 +8911,16 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * o unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels; #ifdef STBIR_SIMD - stbir_uint32 const * to_srgb = fp32_to_srgb8_tab4 - (127-13)*8; if ( width_times_channels >= 16 ) { float const * end_encode_m16 = encode + width_times_channels - 16; end_output -= 16; + STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR for(;;) { stbir__simdf f0, f1, f2, f3; - stbir__simdi i0, i1, i2, i3; + stbir__simdi i0, i1, i2, i3; STBIR_SIMD_NO_UNROLL(encode); stbir__simdf_load4_transposed( f0, f1, f2, f3, encode ); @@ -8717,9 +8929,9 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * o stbir__scale_and_convert( i1, f1 ); stbir__min_max_shift20( i2, f2 ); stbir__scale_and_convert( i3, f3 ); - - stbir__simdi_table_lookup2( i0, i2, to_srgb ); - + + stbir__simdi_table_lookup2( i0, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) ); + stbir__linear_to_srgb_finish( i0, f0 ); stbir__linear_to_srgb_finish( i2, f2 ); @@ -8727,7 +8939,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * o output += 16; encode += 16; - if ( output <= end_output ) + if ( output <= end_output ) continue; if ( output == ( end_output + 16 ) ) break; @@ -8738,6 +8950,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * o } #endif + STBIR_SIMD_NO_UNROLL_LOOP_START do { float f; STBIR_SIMD_NO_UNROLL(encode); @@ -8766,6 +8979,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decod if ( width_times_channels >= 8 ) { decode_end -= 8; + STBIR_NO_UNROLL_LOOP_START_INF_FOR for(;;) { #ifdef STBIR_SIMD8 @@ -8793,9 +9007,9 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decod stbir__simdf_store( decode + 0, of0 ); stbir__simdf_store( decode + 4, of1 ); #endif - decode += 8; + decode += 8; input += 8; - if ( decode <= decode_end ) + if ( decode <= decode_end ) continue; if ( decode == ( decode_end + 8 ) ) break; @@ -8809,6 +9023,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decod // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four decode += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while( decode <= decode_end ) { STBIR_SIMD_NO_UNROLL(decode); @@ -8824,6 +9039,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decod // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( decode < decode_end ) { STBIR_NO_UNROLL(decode); @@ -8852,6 +9068,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output { float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2; end_output -= stbir__simdfX_float_count*2; + STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR for(;;) { stbir__simdfX e0, e1; @@ -8865,7 +9082,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output stbir__simdiX_store( output, i ); encode += stbir__simdfX_float_count*2; output += stbir__simdfX_float_count*2; - if ( output <= end_output ) + if ( output <= end_output ) continue; if ( output == ( end_output + stbir__simdfX_float_count*2 ) ) break; @@ -8879,6 +9096,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four output += 4; + STBIR_NO_UNROLL_LOOP_START while( output <= end_output ) { stbir__simdf e; @@ -8897,6 +9115,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( output < end_output ) { stbir__simdf e; @@ -8912,12 +9131,13 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output encode += stbir__coder_min_num; } #endif - + #else // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four output += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while( output <= end_output ) { float f; @@ -8934,6 +9154,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( output < end_output ) { float f; @@ -8963,6 +9184,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int if ( width_times_channels >= 8 ) { decode_end -= 8; + STBIR_NO_UNROLL_LOOP_START_INF_FOR for(;;) { #ifdef STBIR_SIMD8 @@ -8989,7 +9211,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int #endif decode += 8; input += 8; - if ( decode <= decode_end ) + if ( decode <= decode_end ) continue; if ( decode == ( decode_end + 8 ) ) break; @@ -9003,6 +9225,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four decode += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while( decode <= decode_end ) { STBIR_SIMD_NO_UNROLL(decode); @@ -9018,6 +9241,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( decode < decode_end ) { STBIR_NO_UNROLL(decode); @@ -9045,6 +9269,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int { float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2; end_output -= stbir__simdfX_float_count*2; + STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR for(;;) { stbir__simdfX e0, e1; @@ -9058,7 +9283,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int stbir__simdiX_store( output, i ); encode += stbir__simdfX_float_count*2; output += stbir__simdfX_float_count*2; - if ( output <= end_output ) + if ( output <= end_output ) continue; if ( output == ( end_output + stbir__simdfX_float_count*2 ) ) break; @@ -9072,6 +9297,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four output += 4; + STBIR_NO_UNROLL_LOOP_START while( output <= end_output ) { stbir__simdf e; @@ -9093,6 +9319,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four output += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while( output <= end_output ) { float f; @@ -9111,6 +9338,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( output < end_output ) { float f; @@ -9139,6 +9367,7 @@ static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep, { stbir__FP16 const * end_input_m8 = input + width_times_channels - 8; decode_end -= 8; + STBIR_NO_UNROLL_LOOP_START_INF_FOR for(;;) { STBIR_NO_UNROLL(decode); @@ -9166,7 +9395,7 @@ static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep, #endif decode += 8; input += 8; - if ( decode <= decode_end ) + if ( decode <= decode_end ) continue; if ( decode == ( decode_end + 8 ) ) break; @@ -9180,6 +9409,7 @@ static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep, // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four decode += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while( decode <= decode_end ) { STBIR_SIMD_NO_UNROLL(decode); @@ -9195,6 +9425,7 @@ static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep, // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( decode < decode_end ) { STBIR_NO_UNROLL(decode); @@ -9221,6 +9452,7 @@ static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp { float const * end_encode_m8 = encode + width_times_channels - 8; end_output -= 8; + STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR for(;;) { STBIR_SIMD_NO_UNROLL(encode); @@ -9247,7 +9479,7 @@ static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp #endif encode += 8; output += 8; - if ( output <= end_output ) + if ( output <= end_output ) continue; if ( output == ( end_output + 8 ) ) break; @@ -9261,6 +9493,7 @@ static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four output += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while( output <= end_output ) { STBIR_SIMD_NO_UNROLL(output); @@ -9276,6 +9509,7 @@ static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( output < end_output ) { STBIR_NO_UNROLL(output); @@ -9304,6 +9538,7 @@ static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int { float const * end_input_m16 = input + width_times_channels - 16; decode_end -= 16; + STBIR_NO_UNROLL_LOOP_START_INF_FOR for(;;) { STBIR_NO_UNROLL(decode); @@ -9338,7 +9573,7 @@ static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int #endif decode += 16; input += 16; - if ( decode <= decode_end ) + if ( decode <= decode_end ) continue; if ( decode == ( decode_end + 16 ) ) break; @@ -9352,6 +9587,7 @@ static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four decode += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while( decode <= decode_end ) { STBIR_SIMD_NO_UNROLL(decode); @@ -9367,6 +9603,7 @@ static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( decode < decode_end ) { STBIR_NO_UNROLL(decode); @@ -9383,10 +9620,10 @@ static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int #endif #else - + if ( (void*)decodep != inputp ) STBIR_MEMCPY( decodep, inputp, width_times_channels * sizeof( float ) ); - + #endif } @@ -9426,6 +9663,7 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int { float const * end_encode_m8 = encode + width_times_channels - ( stbir__simdfX_float_count * 2 ); end_output -= ( stbir__simdfX_float_count * 2 ); + STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR for(;;) { stbir__simdfX e0, e1; @@ -9435,18 +9673,18 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int #ifdef STBIR_FLOAT_HIGH_CLAMP stbir__simdfX_min( e0, e0, high_clamp ); stbir__simdfX_min( e1, e1, high_clamp ); -#endif +#endif #ifdef STBIR_FLOAT_LOW_CLAMP stbir__simdfX_max( e0, e0, low_clamp ); stbir__simdfX_max( e1, e1, low_clamp ); -#endif +#endif stbir__encode_simdfX_unflip( e0 ); stbir__encode_simdfX_unflip( e1 ); stbir__simdfX_store( output, e0 ); stbir__simdfX_store( output+stbir__simdfX_float_count, e1 ); encode += stbir__simdfX_float_count * 2; output += stbir__simdfX_float_count * 2; - if ( output < end_output ) + if ( output < end_output ) continue; if ( output == ( end_output + ( stbir__simdfX_float_count * 2 ) ) ) break; @@ -9459,6 +9697,7 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four output += 4; + STBIR_NO_UNROLL_LOOP_START while( output <= end_output ) { stbir__simdf e0; @@ -9466,10 +9705,10 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int stbir__simdf_load( e0, encode ); #ifdef STBIR_FLOAT_HIGH_CLAMP stbir__simdf_min( e0, e0, high_clamp ); -#endif +#endif #ifdef STBIR_FLOAT_LOW_CLAMP stbir__simdf_max( e0, e0, low_clamp ); -#endif +#endif stbir__encode_simdf4_unflip( e0 ); stbir__simdf_store( output-4, e0 ); output += 4; @@ -9483,6 +9722,7 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int // try to do blocks of 4 when you can #if stbir__coder_min_num != 3 // doesn't divide cleanly by four output += 4; + STBIR_SIMD_NO_UNROLL_LOOP_START while( output <= end_output ) { float e; @@ -9502,6 +9742,7 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int // do the remnants #if stbir__coder_min_num < 4 + STBIR_NO_UNROLL_LOOP_START while( output < end_output ) { float e; @@ -9517,18 +9758,18 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int encode += stbir__coder_min_num; } #endif - + #endif } -#undef stbir__decode_suffix +#undef stbir__decode_suffix #undef stbir__decode_simdf8_flip #undef stbir__decode_simdf4_flip -#undef stbir__decode_order0 +#undef stbir__decode_order0 #undef stbir__decode_order1 #undef stbir__decode_order2 #undef stbir__decode_order3 -#undef stbir__encode_order0 +#undef stbir__encode_order0 #undef stbir__encode_order1 #undef stbir__encode_order2 #undef stbir__encode_order3 @@ -9612,7 +9853,8 @@ static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** output stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); ) stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); ) stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); ) - while ( ( (char*)input_end - (char*) input ) >= (16*stbir__simdfX_float_count) ) + STBIR_SIMD_NO_UNROLL_LOOP_START + while ( ( (char*)input_end - (char*) input ) >= (16*stbir__simdfX_float_count) ) { stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3; STBIR_SIMD_NO_UNROLL(output0); @@ -9621,52 +9863,53 @@ static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** output #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE stbIF0( stbir__simdfX_load( o0, output0 ); stbir__simdfX_load( o1, output0+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output0+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output0+(3*stbir__simdfX_float_count) ); - stbir__simdfX_madd( o0, o0, r0, c0 ); stbir__simdfX_madd( o1, o1, r1, c0 ); stbir__simdfX_madd( o2, o2, r2, c0 ); stbir__simdfX_madd( o3, o3, r3, c0 ); + stbir__simdfX_madd( o0, o0, r0, c0 ); stbir__simdfX_madd( o1, o1, r1, c0 ); stbir__simdfX_madd( o2, o2, r2, c0 ); stbir__simdfX_madd( o3, o3, r3, c0 ); stbir__simdfX_store( output0, o0 ); stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); ) stbIF1( stbir__simdfX_load( o0, output1 ); stbir__simdfX_load( o1, output1+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output1+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output1+(3*stbir__simdfX_float_count) ); - stbir__simdfX_madd( o0, o0, r0, c1 ); stbir__simdfX_madd( o1, o1, r1, c1 ); stbir__simdfX_madd( o2, o2, r2, c1 ); stbir__simdfX_madd( o3, o3, r3, c1 ); + stbir__simdfX_madd( o0, o0, r0, c1 ); stbir__simdfX_madd( o1, o1, r1, c1 ); stbir__simdfX_madd( o2, o2, r2, c1 ); stbir__simdfX_madd( o3, o3, r3, c1 ); stbir__simdfX_store( output1, o0 ); stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); ) stbIF2( stbir__simdfX_load( o0, output2 ); stbir__simdfX_load( o1, output2+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output2+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output2+(3*stbir__simdfX_float_count) ); - stbir__simdfX_madd( o0, o0, r0, c2 ); stbir__simdfX_madd( o1, o1, r1, c2 ); stbir__simdfX_madd( o2, o2, r2, c2 ); stbir__simdfX_madd( o3, o3, r3, c2 ); + stbir__simdfX_madd( o0, o0, r0, c2 ); stbir__simdfX_madd( o1, o1, r1, c2 ); stbir__simdfX_madd( o2, o2, r2, c2 ); stbir__simdfX_madd( o3, o3, r3, c2 ); stbir__simdfX_store( output2, o0 ); stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); ) stbIF3( stbir__simdfX_load( o0, output3 ); stbir__simdfX_load( o1, output3+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output3+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output3+(3*stbir__simdfX_float_count) ); - stbir__simdfX_madd( o0, o0, r0, c3 ); stbir__simdfX_madd( o1, o1, r1, c3 ); stbir__simdfX_madd( o2, o2, r2, c3 ); stbir__simdfX_madd( o3, o3, r3, c3 ); + stbir__simdfX_madd( o0, o0, r0, c3 ); stbir__simdfX_madd( o1, o1, r1, c3 ); stbir__simdfX_madd( o2, o2, r2, c3 ); stbir__simdfX_madd( o3, o3, r3, c3 ); stbir__simdfX_store( output3, o0 ); stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); ) stbIF4( stbir__simdfX_load( o0, output4 ); stbir__simdfX_load( o1, output4+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output4+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output4+(3*stbir__simdfX_float_count) ); - stbir__simdfX_madd( o0, o0, r0, c4 ); stbir__simdfX_madd( o1, o1, r1, c4 ); stbir__simdfX_madd( o2, o2, r2, c4 ); stbir__simdfX_madd( o3, o3, r3, c4 ); + stbir__simdfX_madd( o0, o0, r0, c4 ); stbir__simdfX_madd( o1, o1, r1, c4 ); stbir__simdfX_madd( o2, o2, r2, c4 ); stbir__simdfX_madd( o3, o3, r3, c4 ); stbir__simdfX_store( output4, o0 ); stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); ) stbIF5( stbir__simdfX_load( o0, output5 ); stbir__simdfX_load( o1, output5+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output5+(2*stbir__simdfX_float_count)); stbir__simdfX_load( o3, output5+(3*stbir__simdfX_float_count) ); - stbir__simdfX_madd( o0, o0, r0, c5 ); stbir__simdfX_madd( o1, o1, r1, c5 ); stbir__simdfX_madd( o2, o2, r2, c5 ); stbir__simdfX_madd( o3, o3, r3, c5 ); + stbir__simdfX_madd( o0, o0, r0, c5 ); stbir__simdfX_madd( o1, o1, r1, c5 ); stbir__simdfX_madd( o2, o2, r2, c5 ); stbir__simdfX_madd( o3, o3, r3, c5 ); stbir__simdfX_store( output5, o0 ); stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); ) stbIF6( stbir__simdfX_load( o0, output6 ); stbir__simdfX_load( o1, output6+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output6+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output6+(3*stbir__simdfX_float_count) ); - stbir__simdfX_madd( o0, o0, r0, c6 ); stbir__simdfX_madd( o1, o1, r1, c6 ); stbir__simdfX_madd( o2, o2, r2, c6 ); stbir__simdfX_madd( o3, o3, r3, c6 ); + stbir__simdfX_madd( o0, o0, r0, c6 ); stbir__simdfX_madd( o1, o1, r1, c6 ); stbir__simdfX_madd( o2, o2, r2, c6 ); stbir__simdfX_madd( o3, o3, r3, c6 ); stbir__simdfX_store( output6, o0 ); stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); ) stbIF7( stbir__simdfX_load( o0, output7 ); stbir__simdfX_load( o1, output7+stbir__simdfX_float_count ); stbir__simdfX_load( o2, output7+(2*stbir__simdfX_float_count) ); stbir__simdfX_load( o3, output7+(3*stbir__simdfX_float_count) ); - stbir__simdfX_madd( o0, o0, r0, c7 ); stbir__simdfX_madd( o1, o1, r1, c7 ); stbir__simdfX_madd( o2, o2, r2, c7 ); stbir__simdfX_madd( o3, o3, r3, c7 ); + stbir__simdfX_madd( o0, o0, r0, c7 ); stbir__simdfX_madd( o1, o1, r1, c7 ); stbir__simdfX_madd( o2, o2, r2, c7 ); stbir__simdfX_madd( o3, o3, r3, c7 ); stbir__simdfX_store( output7, o0 ); stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); ) #else - stbIF0( stbir__simdfX_mult( o0, r0, c0 ); stbir__simdfX_mult( o1, r1, c0 ); stbir__simdfX_mult( o2, r2, c0 ); stbir__simdfX_mult( o3, r3, c0 ); + stbIF0( stbir__simdfX_mult( o0, r0, c0 ); stbir__simdfX_mult( o1, r1, c0 ); stbir__simdfX_mult( o2, r2, c0 ); stbir__simdfX_mult( o3, r3, c0 ); stbir__simdfX_store( output0, o0 ); stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); ) - stbIF1( stbir__simdfX_mult( o0, r0, c1 ); stbir__simdfX_mult( o1, r1, c1 ); stbir__simdfX_mult( o2, r2, c1 ); stbir__simdfX_mult( o3, r3, c1 ); + stbIF1( stbir__simdfX_mult( o0, r0, c1 ); stbir__simdfX_mult( o1, r1, c1 ); stbir__simdfX_mult( o2, r2, c1 ); stbir__simdfX_mult( o3, r3, c1 ); stbir__simdfX_store( output1, o0 ); stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); ) - stbIF2( stbir__simdfX_mult( o0, r0, c2 ); stbir__simdfX_mult( o1, r1, c2 ); stbir__simdfX_mult( o2, r2, c2 ); stbir__simdfX_mult( o3, r3, c2 ); + stbIF2( stbir__simdfX_mult( o0, r0, c2 ); stbir__simdfX_mult( o1, r1, c2 ); stbir__simdfX_mult( o2, r2, c2 ); stbir__simdfX_mult( o3, r3, c2 ); stbir__simdfX_store( output2, o0 ); stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); ) - stbIF3( stbir__simdfX_mult( o0, r0, c3 ); stbir__simdfX_mult( o1, r1, c3 ); stbir__simdfX_mult( o2, r2, c3 ); stbir__simdfX_mult( o3, r3, c3 ); + stbIF3( stbir__simdfX_mult( o0, r0, c3 ); stbir__simdfX_mult( o1, r1, c3 ); stbir__simdfX_mult( o2, r2, c3 ); stbir__simdfX_mult( o3, r3, c3 ); stbir__simdfX_store( output3, o0 ); stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); ) - stbIF4( stbir__simdfX_mult( o0, r0, c4 ); stbir__simdfX_mult( o1, r1, c4 ); stbir__simdfX_mult( o2, r2, c4 ); stbir__simdfX_mult( o3, r3, c4 ); + stbIF4( stbir__simdfX_mult( o0, r0, c4 ); stbir__simdfX_mult( o1, r1, c4 ); stbir__simdfX_mult( o2, r2, c4 ); stbir__simdfX_mult( o3, r3, c4 ); stbir__simdfX_store( output4, o0 ); stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); ) - stbIF5( stbir__simdfX_mult( o0, r0, c5 ); stbir__simdfX_mult( o1, r1, c5 ); stbir__simdfX_mult( o2, r2, c5 ); stbir__simdfX_mult( o3, r3, c5 ); + stbIF5( stbir__simdfX_mult( o0, r0, c5 ); stbir__simdfX_mult( o1, r1, c5 ); stbir__simdfX_mult( o2, r2, c5 ); stbir__simdfX_mult( o3, r3, c5 ); stbir__simdfX_store( output5, o0 ); stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); ) - stbIF6( stbir__simdfX_mult( o0, r0, c6 ); stbir__simdfX_mult( o1, r1, c6 ); stbir__simdfX_mult( o2, r2, c6 ); stbir__simdfX_mult( o3, r3, c6 ); + stbIF6( stbir__simdfX_mult( o0, r0, c6 ); stbir__simdfX_mult( o1, r1, c6 ); stbir__simdfX_mult( o2, r2, c6 ); stbir__simdfX_mult( o3, r3, c6 ); stbir__simdfX_store( output6, o0 ); stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); ) - stbIF7( stbir__simdfX_mult( o0, r0, c7 ); stbir__simdfX_mult( o1, r1, c7 ); stbir__simdfX_mult( o2, r2, c7 ); stbir__simdfX_mult( o3, r3, c7 ); + stbIF7( stbir__simdfX_mult( o0, r0, c7 ); stbir__simdfX_mult( o1, r1, c7 ); stbir__simdfX_mult( o2, r2, c7 ); stbir__simdfX_mult( o3, r3, c7 ); stbir__simdfX_store( output7, o0 ); stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 ); stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 ); stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); ) #endif input += (4*stbir__simdfX_float_count); stbIF0( output0 += (4*stbir__simdfX_float_count); ) stbIF1( output1 += (4*stbir__simdfX_float_count); ) stbIF2( output2 += (4*stbir__simdfX_float_count); ) stbIF3( output3 += (4*stbir__simdfX_float_count); ) stbIF4( output4 += (4*stbir__simdfX_float_count); ) stbIF5( output5 += (4*stbir__simdfX_float_count); ) stbIF6( output6 += (4*stbir__simdfX_float_count); ) stbIF7( output7 += (4*stbir__simdfX_float_count); ) } - while ( ( (char*)input_end - (char*) input ) >= 16 ) + STBIR_SIMD_NO_UNROLL_LOOP_START + while ( ( (char*)input_end - (char*) input ) >= 16 ) { stbir__simdf o0, r0; STBIR_SIMD_NO_UNROLL(output0); @@ -9692,13 +9935,14 @@ static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** output stbIF6( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) ); stbir__simdf_store( output6, o0 ); ) stbIF7( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) ); stbir__simdf_store( output7, o0 ); ) #endif - + input += 4; stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; ) } } #else - while ( ( (char*)input_end - (char*) input ) >= 16 ) + STBIR_NO_UNROLL_LOOP_START + while ( ( (char*)input_end - (char*) input ) >= 16 ) { float r0, r1, r2, r3; STBIR_NO_UNROLL(input); @@ -9729,7 +9973,8 @@ static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** output stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; ) } #endif - while ( input < input_end ) + STBIR_NO_UNROLL_LOOP_START + while ( input < input_end ) { float r = input[0]; STBIR_NO_UNROLL(output0); @@ -9779,7 +10024,7 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, STBIR_MEMCPY( output, input0, (char*)input0_end - (char*)input0 ); return; } -#endif +#endif #ifdef STBIR_SIMD { @@ -9791,14 +10036,15 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); ) stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); ) stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); ) - - while ( ( (char*)input0_end - (char*) input0 ) >= (16*stbir__simdfX_float_count) ) + + STBIR_SIMD_NO_UNROLL_LOOP_START + while ( ( (char*)input0_end - (char*) input0 ) >= (16*stbir__simdfX_float_count) ) { stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3; STBIR_SIMD_NO_UNROLL(output); // prefetch four loop iterations ahead (doesn't affect much for small resizes, but helps with big ones) - stbIF0( stbir__prefetch( input0 + (16*stbir__simdfX_float_count) ); ) + stbIF0( stbir__prefetch( input0 + (16*stbir__simdfX_float_count) ); ) stbIF1( stbir__prefetch( input1 + (16*stbir__simdfX_float_count) ); ) stbIF2( stbir__prefetch( input2 + (16*stbir__simdfX_float_count) ); ) stbIF3( stbir__prefetch( input3 + (16*stbir__simdfX_float_count) ); ) @@ -9836,7 +10082,8 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, stbIF0( input0 += (4*stbir__simdfX_float_count); ) stbIF1( input1 += (4*stbir__simdfX_float_count); ) stbIF2( input2 += (4*stbir__simdfX_float_count); ) stbIF3( input3 += (4*stbir__simdfX_float_count); ) stbIF4( input4 += (4*stbir__simdfX_float_count); ) stbIF5( input5 += (4*stbir__simdfX_float_count); ) stbIF6( input6 += (4*stbir__simdfX_float_count); ) stbIF7( input7 += (4*stbir__simdfX_float_count); ) } - while ( ( (char*)input0_end - (char*) input0 ) >= 16 ) + STBIR_SIMD_NO_UNROLL_LOOP_START + while ( ( (char*)input0_end - (char*) input0 ) >= 16 ) { stbir__simdf o0, r0; STBIR_SIMD_NO_UNROLL(output); @@ -9860,7 +10107,8 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, } } #else - while ( ( (char*)input0_end - (char*) input0 ) >= 16 ) + STBIR_NO_UNROLL_LOOP_START + while ( ( (char*)input0_end - (char*) input0 ) >= 16 ) { float o0, o1, o2, o3; STBIR_NO_UNROLL(output); @@ -9881,7 +10129,8 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, stbIF0( input0 += 4; ) stbIF1( input1 += 4; ) stbIF2( input2 += 4; ) stbIF3( input3 += 4; ) stbIF4( input4 += 4; ) stbIF5( input5 += 4; ) stbIF6( input6 += 4; ) stbIF7( input7 += 4; ) } #endif - while ( input0 < input0_end ) + STBIR_NO_UNROLL_LOOP_START + while ( input0 < input0_end ) { float o0; STBIR_NO_UNROLL(output); @@ -9897,7 +10146,7 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, stbIF5( o0 += input5[0] * c5s; ) stbIF6( o0 += input6[0] * c6s; ) stbIF7( o0 += input7[0] * c7s; ) - output[0] = o0; + output[0] = o0; ++output; stbIF0( ++input0; ) stbIF1( ++input1; ) stbIF2( ++input2; ) stbIF3( ++input3; ) stbIF4( ++input4; ) stbIF5( ++input5; ) stbIF6( ++input6; ) stbIF7( ++input7; ) } @@ -9928,25 +10177,25 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, #ifndef stbir__2_coeff_only #define stbir__2_coeff_only() \ stbir__1_coeff_only(); \ - stbir__1_coeff_remnant(1); + stbir__1_coeff_remnant(1); #endif #ifndef stbir__2_coeff_remnant #define stbir__2_coeff_remnant( ofs ) \ stbir__1_coeff_remnant(ofs); \ - stbir__1_coeff_remnant((ofs)+1); + stbir__1_coeff_remnant((ofs)+1); #endif - + #ifndef stbir__3_coeff_only #define stbir__3_coeff_only() \ stbir__2_coeff_only(); \ - stbir__1_coeff_remnant(2); + stbir__1_coeff_remnant(2); #endif - + #ifndef stbir__3_coeff_remnant #define stbir__3_coeff_remnant( ofs ) \ stbir__2_coeff_remnant(ofs); \ - stbir__1_coeff_remnant((ofs)+2); + stbir__1_coeff_remnant((ofs)+2); #endif #ifndef stbir__3_coeff_setup @@ -9956,13 +10205,13 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp, #ifndef stbir__4_coeff_start #define stbir__4_coeff_start() \ stbir__2_coeff_only(); \ - stbir__2_coeff_remnant(2); + stbir__2_coeff_remnant(2); #endif - + #ifndef stbir__4_coeff_continue_from_4 #define stbir__4_coeff_continue_from_4( ofs ) \ stbir__2_coeff_remnant(ofs); \ - stbir__2_coeff_remnant((ofs)+2); + stbir__2_coeff_remnant((ofs)+2); #endif #ifndef stbir__store_output_tiny @@ -9973,8 +10222,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_1_coeff)( floa { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__1_coeff_only(); stbir__store_output_tiny(); @@ -9985,8 +10235,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_2_coeffs)( flo { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__2_coeff_only(); stbir__store_output_tiny(); @@ -9997,8 +10248,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_3_coeffs)( flo { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__3_coeff_only(); stbir__store_output_tiny(); @@ -10009,8 +10261,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_4_coeffs)( flo { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); stbir__store_output(); @@ -10021,8 +10274,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_5_coeffs)( flo { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); stbir__1_coeff_remnant(4); @@ -10034,8 +10288,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_6_coeffs)( flo { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); stbir__2_coeff_remnant(4); @@ -10048,10 +10303,11 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_7_coeffs)( flo float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; stbir__3_coeff_setup(); + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; - + stbir__4_coeff_start(); stbir__3_coeff_remnant(4); stbir__store_output(); @@ -10062,8 +10318,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_8_coeffs)( flo { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); stbir__4_coeff_continue_from_4(4); @@ -10075,8 +10332,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_9_coeffs)( flo { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); stbir__4_coeff_continue_from_4(4); @@ -10089,8 +10347,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_10_coeffs)( fl { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); stbir__4_coeff_continue_from_4(4); @@ -10104,8 +10363,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_11_coeffs)( fl float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; stbir__3_coeff_setup(); + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); stbir__4_coeff_continue_from_4(4); @@ -10118,8 +10378,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_12_coeffs)( fl { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); stbir__4_coeff_continue_from_4(4); @@ -10132,12 +10393,14 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod0 { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; - int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 4 + 3 ) >> 2; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 4 + 3 ) >> 2; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); + STBIR_SIMD_NO_UNROLL_LOOP_START do { hc += 4; decode += STBIR__horizontal_channels * 4; @@ -10152,19 +10415,21 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod1 { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; - int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 5 + 3 ) >> 2; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 5 + 3 ) >> 2; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); + STBIR_SIMD_NO_UNROLL_LOOP_START do { hc += 4; decode += STBIR__horizontal_channels * 4; stbir__4_coeff_continue_from_4( 0 ); --n; } while ( n > 0 ); - stbir__1_coeff_remnant( 4 ); + stbir__1_coeff_remnant( 4 ); stbir__store_output(); } while ( output < output_end ); } @@ -10173,19 +10438,21 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod2 { float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; - int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 6 + 3 ) >> 2; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 6 + 3 ) >> 2; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); + STBIR_SIMD_NO_UNROLL_LOOP_START do { hc += 4; decode += STBIR__horizontal_channels * 4; stbir__4_coeff_continue_from_4( 0 ); --n; } while ( n > 0 ); - stbir__2_coeff_remnant( 4 ); + stbir__2_coeff_remnant( 4 ); stbir__store_output(); } while ( output < output_end ); @@ -10196,19 +10463,21 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod3 float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels; float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer; stbir__3_coeff_setup(); + STBIR_SIMD_NO_UNROLL_LOOP_START do { - float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; - int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 7 + 3 ) >> 2; + float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; + int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 7 + 3 ) >> 2; float const * hc = horizontal_coefficients; stbir__4_coeff_start(); + STBIR_SIMD_NO_UNROLL_LOOP_START do { hc += 4; decode += STBIR__horizontal_channels * 4; stbir__4_coeff_continue_from_4( 0 ); --n; } while ( n > 0 ); - stbir__3_coeff_remnant( 4 ); + stbir__3_coeff_remnant( 4 ); stbir__store_output(); } while ( output < output_end ); @@ -10216,26 +10485,26 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod3 static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_funcs)[4]= { - STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod0), - STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod1), - STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod2), - STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod3), + STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod0), + STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod1), + STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod2), + STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod3), }; static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_funcs)[12]= { - STBIR_chans(stbir__horizontal_gather_,_channels_with_1_coeff), - STBIR_chans(stbir__horizontal_gather_,_channels_with_2_coeffs), + STBIR_chans(stbir__horizontal_gather_,_channels_with_1_coeff), + STBIR_chans(stbir__horizontal_gather_,_channels_with_2_coeffs), STBIR_chans(stbir__horizontal_gather_,_channels_with_3_coeffs), - STBIR_chans(stbir__horizontal_gather_,_channels_with_4_coeffs), - STBIR_chans(stbir__horizontal_gather_,_channels_with_5_coeffs), - STBIR_chans(stbir__horizontal_gather_,_channels_with_6_coeffs), + STBIR_chans(stbir__horizontal_gather_,_channels_with_4_coeffs), + STBIR_chans(stbir__horizontal_gather_,_channels_with_5_coeffs), + STBIR_chans(stbir__horizontal_gather_,_channels_with_6_coeffs), STBIR_chans(stbir__horizontal_gather_,_channels_with_7_coeffs), - STBIR_chans(stbir__horizontal_gather_,_channels_with_8_coeffs), - STBIR_chans(stbir__horizontal_gather_,_channels_with_9_coeffs), - STBIR_chans(stbir__horizontal_gather_,_channels_with_10_coeffs), - STBIR_chans(stbir__horizontal_gather_,_channels_with_11_coeffs), - STBIR_chans(stbir__horizontal_gather_,_channels_with_12_coeffs), + STBIR_chans(stbir__horizontal_gather_,_channels_with_8_coeffs), + STBIR_chans(stbir__horizontal_gather_,_channels_with_9_coeffs), + STBIR_chans(stbir__horizontal_gather_,_channels_with_10_coeffs), + STBIR_chans(stbir__horizontal_gather_,_channels_with_11_coeffs), + STBIR_chans(stbir__horizontal_gather_,_channels_with_12_coeffs), }; #undef STBIR__horizontal_channels @@ -10266,38 +10535,38 @@ This software is available under 2 licenses -- choose whichever you prefer. ------------------------------------------------------------------------------ ALTERNATIVE A - MIT License Copyright (c) 2017 Sean Barrett -Permission is hereby granted, free of charge, to any person obtaining a copy of -this software and associated documentation files (the "Software"), to deal in -the Software without restriction, including without limitation the rights to -use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies -of the Software, and to permit persons to whom the Software is furnished to do +Permission is hereby granted, free of charge, to any person obtaining a copy of +this software and associated documentation files (the "Software"), to deal in +the Software without restriction, including without limitation the rights to +use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies +of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: -The above copyright notice and this permission notice shall be included in all +The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER -LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, -OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER +LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, +OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ------------------------------------------------------------------------------ ALTERNATIVE B - Public Domain (www.unlicense.org) This is free and unencumbered software released into the public domain. -Anyone is free to copy, modify, publish, use, compile, sell, or distribute this -software, either in source code form or as a compiled binary, for any purpose, +Anyone is free to copy, modify, publish, use, compile, sell, or distribute this +software, either in source code form or as a compiled binary, for any purpose, commercial or non-commercial, and by any means. -In jurisdictions that recognize copyright laws, the author or authors of this -software dedicate any and all copyright interest in the software to the public -domain. We make this dedication for the benefit of the public at large and to -the detriment of our heirs and successors. We intend this dedication to be an -overt act of relinquishment in perpetuity of all present and future rights to +In jurisdictions that recognize copyright laws, the author or authors of this +software dedicate any and all copyright interest in the software to the public +domain. We make this dedication for the benefit of the public at large and to +the detriment of our heirs and successors. We intend this dedication to be an +overt act of relinquishment in perpetuity of all present and future rights to this software under copyright law. -THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR -IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, -FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE -AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN -ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION +THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR +IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, +FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE +AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN +ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. ------------------------------------------------------------------------------ */