diff --git a/src/external/stb_image_resize2.h b/src/external/stb_image_resize2.h
index e0c42824..ae8d730b 100644
--- a/src/external/stb_image_resize2.h
+++ b/src/external/stb_image_resize2.h
@@ -1,9 +1,9 @@
-/* stb_image_resize2 - v2.01 - public domain image resizing
-   
-   by Jeff Roberts (v2) and Jorge L Rodriguez 
+/* stb_image_resize2 - v2.10 - public domain image resizing
+
+   by Jeff Roberts (v2) and Jorge L Rodriguez
    http://github.com/nothings/stb
 
-   Can be threaded with the extended API. SSE2, AVX, Neon and WASM SIMD support. Only 
+   Can be threaded with the extended API. SSE2, AVX, Neon and WASM SIMD support. Only
    scaling and translation is supported, no rotations or shears.
 
    COMPILING & LINKING
@@ -67,60 +67,60 @@
    ADDITIONAL DOCUMENTATION
 
       MEMORY ALLOCATION
-         By default, we use malloc and free for memory allocation.  To override the 
+         By default, we use malloc and free for memory allocation.  To override the
          memory allocation, before the implementation #include, add a:
 
             #define STBIR_MALLOC(size,user_data) ...
             #define STBIR_FREE(ptr,user_data)   ...
 
-         Each resize makes exactly one call to malloc/free (unless you use the 
+         Each resize makes exactly one call to malloc/free (unless you use the
          extended API where you can do one allocation for many resizes). Under
          address sanitizer, we do separate allocations to find overread/writes.
 
       PERFORMANCE
          This library was written with an emphasis on performance. When testing
-         stb_image_resize with RGBA, the fastest mode is STBIR_4CHANNEL with 
+         stb_image_resize with RGBA, the fastest mode is STBIR_4CHANNEL with
          STBIR_TYPE_UINT8 pixels and CLAMPed edges (which is what many other resize
-         libs do by default). Also, make sure SIMD is turned on of course (default 
+         libs do by default). Also, make sure SIMD is turned on of course (default
          for 64-bit targets). Avoid WRAP edge mode if you want the fastest speed.
 
          This library also comes with profiling built-in. If you define STBIR_PROFILE,
-         you can use the advanced API and get low-level profiling information by 
+         you can use the advanced API and get low-level profiling information by
          calling stbir_resize_extended_profile_info() or stbir_resize_split_profile_info()
          after a resize.
 
       SIMD
-         Most of the routines have optimized SSE2, AVX, NEON and WASM versions. 
+         Most of the routines have optimized SSE2, AVX, NEON and WASM versions.
 
-         On Microsoft compilers, we automatically turn on SIMD for 64-bit x64 and 
-         ARM; for 32-bit x86 and ARM, you select SIMD mode by defining STBIR_SSE2 or 
+         On Microsoft compilers, we automatically turn on SIMD for 64-bit x64 and
+         ARM; for 32-bit x86 and ARM, you select SIMD mode by defining STBIR_SSE2 or
          STBIR_NEON. For AVX and AVX2, we auto-select it by detecting the /arch:AVX
-         or /arch:AVX2 switches. You can also always manually turn SSE2, AVX or AVX2 
+         or /arch:AVX2 switches. You can also always manually turn SSE2, AVX or AVX2
          support on by defining STBIR_SSE2, STBIR_AVX or STBIR_AVX2.
 
          On Linux, SSE2 and Neon is on by default for 64-bit x64 or ARM64. For 32-bit,
          we select x86 SIMD mode by whether you have -msse2, -mavx or -mavx2 enabled
          on the command line. For 32-bit ARM, you must pass -mfpu=neon-vfpv4 for both
-         clang and GCC, but GCC also requires an additional -mfp16-format=ieee to 
+         clang and GCC, but GCC also requires an additional -mfp16-format=ieee to
          automatically enable NEON.
 
          On x86 platforms, you can also define STBIR_FP16C to turn on FP16C instructions
          for converting back and forth to half-floats. This is autoselected when we
-         are using AVX2. Clang and GCC also require the -mf16c switch. ARM always uses 
-         the built-in half float hardware NEON instructions. 
+         are using AVX2. Clang and GCC also require the -mf16c switch. ARM always uses
+         the built-in half float hardware NEON instructions.
 
-         You can also tell us to use multiply-add instructions with STBIR_USE_FMA. 
+         You can also tell us to use multiply-add instructions with STBIR_USE_FMA.
          Because x86 doesn't always have fma, we turn it off by default to maintain
          determinism across all platforms. If you don't care about non-FMA determinism
-         and are willing to restrict yourself to more recent x86 CPUs (around the AVX 
+         and are willing to restrict yourself to more recent x86 CPUs (around the AVX
          timeframe), then fma will give you around a 15% speedup.
 
          You can force off SIMD in all cases by defining STBIR_NO_SIMD. You can turn
          off AVX or AVX2 specifically with STBIR_NO_AVX or STBIR_NO_AVX2. AVX is 10%
          to 40% faster, and AVX2 is generally another 12%.
-        
+
       ALPHA CHANNEL
-         Most of the resizing functions provide the ability to control how the alpha 
+         Most of the resizing functions provide the ability to control how the alpha
          channel of an image is processed.
 
          When alpha represents transparency, it is important that when combining
@@ -167,33 +167,33 @@
 
          stb_image_resize expects case #1 by default, applying alpha weighting to
          images, expecting the input images to be unpremultiplied. This is what the
-         COLOR+ALPHA buffer types tell the resizer to do. 
+         COLOR+ALPHA buffer types tell the resizer to do.
 
-         When you use the pixel layouts STBIR_RGBA, STBIR_BGRA, STBIR_ARGB, 
-         STBIR_ABGR, STBIR_RX, or STBIR_XR you are telling us that the pixels are 
-         non-premultiplied. In these cases, the resizer will alpha weight the colors 
-         (effectively creating the premultiplied image), do the filtering, and then 
+         When you use the pixel layouts STBIR_RGBA, STBIR_BGRA, STBIR_ARGB,
+         STBIR_ABGR, STBIR_RX, or STBIR_XR you are telling us that the pixels are
+         non-premultiplied. In these cases, the resizer will alpha weight the colors
+         (effectively creating the premultiplied image), do the filtering, and then
          convert back to non-premult on exit.
 
          When you use the pixel layouts STBIR_RGBA_PM, STBIR_RGBA_PM, STBIR_RGBA_PM,
-         STBIR_RGBA_PM, STBIR_RX_PM or STBIR_XR_PM, you are telling that the pixels 
-         ARE premultiplied. In this case, the resizer doesn't have to do the 
-         premultipling - it can filter directly on the input. This about twice as 
-         fast as the non-premultiplied case, so it's the right option if your data is 
+         STBIR_RGBA_PM, STBIR_RX_PM or STBIR_XR_PM, you are telling that the pixels
+         ARE premultiplied. In this case, the resizer doesn't have to do the
+         premultipling - it can filter directly on the input. This about twice as
+         fast as the non-premultiplied case, so it's the right option if your data is
          already setup correctly.
 
-         When you use the pixel layout STBIR_4CHANNEL or STBIR_2CHANNEL, you are 
-         telling us that there is no channel that represents transparency; it may be 
-         RGB and some unrelated fourth channel that has been stored in the alpha 
-         channel, but it is actually not alpha. No special processing will be 
-         performed. 
+         When you use the pixel layout STBIR_4CHANNEL or STBIR_2CHANNEL, you are
+         telling us that there is no channel that represents transparency; it may be
+         RGB and some unrelated fourth channel that has been stored in the alpha
+         channel, but it is actually not alpha. No special processing will be
+         performed.
 
-         The difference between the generic 4 or 2 channel layouts, and the 
+         The difference between the generic 4 or 2 channel layouts, and the
          specialized _PM versions is with the _PM versions you are telling us that
          the data *is* alpha, just don't premultiply it. That's important when
          using SRGB pixel formats, we need to know where the alpha is, because
          it is converted linearly (rather than with the SRGB converters).
-   
+
          Because alpha weighting produces the same effect as premultiplying, you
          even have the option with non-premultiplied inputs to let the resizer
          produce a premultiplied output. Because the intially computed alpha-weighted
@@ -201,10 +201,10 @@
          than the normal path which un-premultiplies the output image as a final step.
 
          Finally, when converting both in and out of non-premulitplied space (for
-         example, when using STBIR_RGBA), we go to somewhat heroic measures to 
-         ensure that areas with zero alpha value pixels get something reasonable 
-         in the RGB values. If you don't care about the RGB values of zero alpha 
-         pixels, you can call the stbir_set_non_pm_alpha_speed_over_quality() 
+         example, when using STBIR_RGBA), we go to somewhat heroic measures to
+         ensure that areas with zero alpha value pixels get something reasonable
+         in the RGB values. If you don't care about the RGB values of zero alpha
+         pixels, you can call the stbir_set_non_pm_alpha_speed_over_quality()
          function - this runs a premultiplied resize about 25% faster. That said,
          when you really care about speed, using premultiplied pixels for both in
          and out (STBIR_RGBA_PM, etc) much faster than both of these premultiplied
@@ -218,38 +218,38 @@
          layouts with the same number of channels.
 
       DETERMINISM
-         We commit to being deterministic (from x64 to ARM to scalar to SIMD, etc). 
-         This requires compiling with fast-math off (using at least /fp:precise). 
+         We commit to being deterministic (from x64 to ARM to scalar to SIMD, etc).
+         This requires compiling with fast-math off (using at least /fp:precise).
          Also, you must turn off fp-contracting (which turns mult+adds into fmas)!
-         We attempt to do this with pragmas, but with Clang, you usually want to add 
+         We attempt to do this with pragmas, but with Clang, you usually want to add
          -ffp-contract=off to the command line as well.
 
-         For 32-bit x86, you must use SSE and SSE2 codegen for determinism. That is, 
-         if the scalar x87 unit gets used at all, we immediately lose determinism. 
+         For 32-bit x86, you must use SSE and SSE2 codegen for determinism. That is,
+         if the scalar x87 unit gets used at all, we immediately lose determinism.
          On Microsoft Visual Studio 2008 and earlier, from what we can tell there is
-         no way to be deterministic in 32-bit x86 (some x87 always leaks in, even 
-         with fp:strict). On 32-bit x86 GCC, determinism requires both -msse2 and 
+         no way to be deterministic in 32-bit x86 (some x87 always leaks in, even
+         with fp:strict). On 32-bit x86 GCC, determinism requires both -msse2 and
          -fpmath=sse.
 
          Note that we will not be deterministic with float data containing NaNs -
-         the NaNs will propagate differently on different SIMD and platforms. 
+         the NaNs will propagate differently on different SIMD and platforms.
 
-         If you turn on STBIR_USE_FMA, then we will be deterministic with other 
-         fma targets, but we will differ from non-fma targets (this is unavoidable, 
-         because a fma isn't simply an add with a mult - it also introduces a 
-         rounding difference compared to non-fma instruction sequences. 
+         If you turn on STBIR_USE_FMA, then we will be deterministic with other
+         fma targets, but we will differ from non-fma targets (this is unavoidable,
+         because a fma isn't simply an add with a mult - it also introduces a
+         rounding difference compared to non-fma instruction sequences.
 
       FLOAT PIXEL FORMAT RANGE
-         Any range of values can be used for the non-alpha float data that you pass 
-         in (0 to 1, -1 to 1, whatever). However, if you are inputting float values 
-         but *outputting* bytes or shorts, you must use a range of 0 to 1 so that we 
-         scale back properly. The alpha channel must also be 0 to 1 for any format 
-         that does premultiplication prior to resizing. 
+         Any range of values can be used for the non-alpha float data that you pass
+         in (0 to 1, -1 to 1, whatever). However, if you are inputting float values
+         but *outputting* bytes or shorts, you must use a range of 0 to 1 so that we
+         scale back properly. The alpha channel must also be 0 to 1 for any format
+         that does premultiplication prior to resizing.
 
-         Note also that with float output, using filters with negative lobes, the 
-         output filtered values might go slightly out of range. You can define 
-         STBIR_FLOAT_LOW_CLAMP and/or STBIR_FLOAT_HIGH_CLAMP to specify the range 
-         to clamp to on output, if that's important. 
+         Note also that with float output, using filters with negative lobes, the
+         output filtered values might go slightly out of range. You can define
+         STBIR_FLOAT_LOW_CLAMP and/or STBIR_FLOAT_HIGH_CLAMP to specify the range
+         to clamp to on output, if that's important.
 
       MAX/MIN SCALE FACTORS
          The input pixel resolutions are in integers, and we do the internal pointer
@@ -263,13 +263,13 @@
          buffers).
 
       FLIPPED IMAGES
-         Stride is just the delta from one scanline to the next. This means you can 
-         use a negative stride to handle inverted images (point to the final 
+         Stride is just the delta from one scanline to the next. This means you can
+         use a negative stride to handle inverted images (point to the final
          scanline and use a negative stride). You can invert the input or output,
          using negative strides.
 
       DEFAULT FILTERS
-         For functions which don't provide explicit control over what filters to 
+         For functions which don't provide explicit control over what filters to
          use, you can change the compile-time defaults with:
 
             #define STBIR_DEFAULT_FILTER_UPSAMPLE     STBIR_FILTER_something
@@ -278,18 +278,18 @@
          See stbir_filter in the header-file section for the list of filters.
 
       NEW FILTERS
-         A number of 1D filter kernels are supplied. For a list of supported 
-         filters, see the stbir_filter enum. You can install your own filters by 
+         A number of 1D filter kernels are supplied. For a list of supported
+         filters, see the stbir_filter enum. You can install your own filters by
          using the stbir_set_filter_callbacks function.
 
       PROGRESS
-         For interactive use with slow resize operations, you can use the the 
-         scanline callbacks in the extended API. It would have to be a *very* large 
+         For interactive use with slow resize operations, you can use the the
+         scanline callbacks in the extended API. It would have to be a *very* large
          image resample to need progress though - we're very fast.
 
       CEIL and FLOOR
-         In scalar mode, the only functions we use from math.h are ceilf and floorf, 
-         but if you have your own versions, you can define the STBIR_CEILF(v) and 
+         In scalar mode, the only functions we use from math.h are ceilf and floorf,
+         but if you have your own versions, you can define the STBIR_CEILF(v) and
          STBIR_FLOORF(v) macros and we'll use them instead. In SIMD, we just use
          our own versions.
 
@@ -304,7 +304,7 @@
          * For SIMD encode and decode scanline routines, do any pre-aligning
            for bad input/output buffer alignments and pitch?
          * For very wide scanlines, we should we do vertical strips to stay within
-           L2 cache. Maybe do chunks of 1K pixels at a time. There would be 
+           L2 cache. Maybe do chunks of 1K pixels at a time. There would be
            some pixel reconversion, but probably dwarfed by things falling out
            of cache. Probably also something possible with alternating between
            scattering and gathering at high resize scales?
@@ -316,21 +316,38 @@
            the pivot cost and the extra memory touches). Need to buffer the whole
            image so have to balance memory use.
          * Most of our code is internally function pointers, should we compile
-           all the SIMD stuff always and dynamically dispatch? 
+           all the SIMD stuff always and dynamically dispatch?
 
    CONTRIBUTORS
       Jeff Roberts: 2.0 implementation, optimizations, SIMD
-      Martins Mozeiko: NEON simd, WASM simd, clang and GCC whisperer.
+      Martins Mozeiko: NEON simd, WASM simd, clang and GCC whisperer
       Fabian Giesen: half float and srgb converters
       Sean Barrett: API design, optimizations
       Jorge L Rodriguez: Original 1.0 implementation
-      Aras Pranckevicius: bugfixes for 1.0
+      Aras Pranckevicius: bugfixes
       Nathan Reed: warning fixes for 1.0
 
    REVISIONS
-      2.00 (2022-02-20) mostly new source: new api, optimizations, simd, vertical-first, etc 
-                       (2x-5x faster without simd, 4x-12x faster with simd)
-                       (in some cases, 20x to 40x faster - resizing to very small for example)
+      2.10 (2024-07-27) fix the defines GCC and mingw for loop unroll control,
+                          fix MSVC 32-bit arm half float routines.
+      2.09 (2024-06-19) fix the defines for 32-bit ARM GCC builds (was selecting
+                          hardware half floats).
+      2.08 (2024-06-10) fix for RGB->BGR three channel flips and add SIMD (thanks
+                          to Ryan Salsbury), fix for sub-rect resizes, use the
+                          pragmas to control unrolling when they are available.
+      2.07 (2024-05-24) fix for slow final split during threaded conversions of very 
+                          wide scanlines when downsampling (caused by extra input 
+                          converting), fix for wide scanline resamples with many 
+                          splits (int overflow), fix GCC warning.
+      2.06 (2024-02-10) fix for identical width/height 3x or more down-scaling 
+                          undersampling a single row on rare resize ratios (about 1%).
+      2.05 (2024-02-07) fix for 2 pixel to 1 pixel resizes with wrap (thanks Aras),
+                        fix for output callback (thanks Julien Koenen).
+      2.04 (2023-11-17) fix for rare AVX bug, shadowed symbol (thanks Nikola Smiljanic).
+      2.03 (2023-11-01) ASAN and TSAN warnings fixed, minor tweaks.
+      2.00 (2023-10-10) mostly new source: new api, optimizations, simd, vertical-first, etc
+                          2x-5x faster without simd, 4x-12x faster with simd,
+                          in some cases, 20x to 40x faster esp resizing large to very small.
       0.96 (2019-03-04) fixed warnings
       0.95 (2017-07-23) fixed warnings
       0.94 (2017-03-18) fixed warnings
@@ -368,7 +385,7 @@ typedef uint64_t stbir_uint64;
 #define STBIR_SSE
 #endif
 #endif
-#endif 
+#endif
 
 #if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(_M_AMD64) || defined(__SSE2__) || defined(STBIR_SSE) || defined(STBIR_SSE2)
   #ifndef STBIR_SSE2
@@ -383,7 +400,7 @@ typedef uint64_t stbir_uint64;
   #endif
   #if defined(__AVX2__) || defined(STBIR_AVX2)
     #ifndef STBIR_NO_AVX2
-      #ifndef STBIR_AVX2  
+      #ifndef STBIR_AVX2
         #define STBIR_AVX2
       #endif
       #if defined( _MSC_VER ) && !defined(__clang__)
@@ -400,15 +417,15 @@ typedef uint64_t stbir_uint64;
   #endif
 #endif
 
-#if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || defined(_M_ARM) || (__ARM_NEON_FP & 4) != 0 &&  __ARM_FP16_FORMAT_IEEE != 0
+#if defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || ((__ARM_NEON_FP & 4) != 0) || defined(__ARM_NEON__)
 #ifndef STBIR_NEON
 #define STBIR_NEON
 #endif
 #endif
 
-#if defined(_M_ARM)
+#if defined(_M_ARM) || defined(__arm__)
 #ifdef STBIR_USE_FMA
-#undef STBIR_USE_FMA // no FMA for 32-bit arm on MSVC 
+#undef STBIR_USE_FMA // no FMA for 32-bit arm on MSVC
 #endif
 #endif
 
@@ -435,7 +452,7 @@ typedef uint64_t stbir_uint64;
 //
 // Easy-to-use API:
 //
-//     * stride is the offset between successive rows of image data 
+//     * stride is the offset between successive rows of image data
 //        in memory, in bytes. specify 0 for packed continuously in memory
 //     * colorspace is linear or sRGB as specified by function name
 //     * Uses the default filters
@@ -448,27 +465,35 @@ typedef uint64_t stbir_uint64;
 //   order of channels
 //   whether color is premultiplied by alpha
 // for back compatibility, you can cast the old channel count to an stbir_pixel_layout
-typedef enum 
+typedef enum
 {
-  STBIR_BGR      = 0,               // 3-chan, with order specified (for channel flipping)
-  STBIR_1CHANNEL = 1,              
+  STBIR_1CHANNEL = 1,
   STBIR_2CHANNEL = 2,
-  STBIR_RGB      = 3,               // 3-chan, with order specified (for channel flipping) 
-  STBIR_RGBA     = 4,               // alpha formats, alpha is NOT premultiplied into color channels
-
+  STBIR_RGB      = 3,               // 3-chan, with order specified (for channel flipping)
+  STBIR_BGR      = 0,               // 3-chan, with order specified (for channel flipping)
   STBIR_4CHANNEL = 5,
+
+  STBIR_RGBA = 4,                   // alpha formats, where alpha is NOT premultiplied into color channels
   STBIR_BGRA = 6,
   STBIR_ARGB = 7,
   STBIR_ABGR = 8,
   STBIR_RA   = 9,
   STBIR_AR   = 10,
 
-  STBIR_RGBA_PM = 11,               // alpha formats, alpha is premultiplied into color channels
+  STBIR_RGBA_PM = 11,               // alpha formats, where alpha is premultiplied into color channels
   STBIR_BGRA_PM = 12,
   STBIR_ARGB_PM = 13,
   STBIR_ABGR_PM = 14,
   STBIR_RA_PM   = 15,
   STBIR_AR_PM   = 16,
+
+  STBIR_RGBA_NO_AW = 11,            // alpha formats, where NO alpha weighting is applied at all!
+  STBIR_BGRA_NO_AW = 12,            //   these are just synonyms for the _PM flags (which also do
+  STBIR_ARGB_NO_AW = 13,            //   no alpha weighting). These names just make it more clear
+  STBIR_ABGR_NO_AW = 14,            //   for some folks).
+  STBIR_RA_NO_AW   = 15,
+  STBIR_AR_NO_AW   = 16,
+
 } stbir_pixel_layout;
 
 //===============================================================
@@ -549,8 +574,8 @@ STBIRDEF void *  stbir_resize( const void *input_pixels , int input_w , int inpu
 //     * Separate input and output data types
 //     * Can specify regions with subpixel correctness
 //     * Can specify alpha flags
-//     * Can specify a memory callback 
-//     * Can specify a callback data type for pixel input and output 
+//     * Can specify a memory callback
+//     * Can specify a callback data type for pixel input and output
 //     * Can be threaded for a single resize
 //     * Can be used to resize many frames without recalculating the sampler info
 //
@@ -577,7 +602,7 @@ typedef float stbir__kernel_callback( float x, float scale, void * user_data );
 typedef float stbir__support_callback( float scale, void * user_data );
 
 // internal structure with precomputed scaling
-typedef struct stbir__info stbir__info; 
+typedef struct stbir__info stbir__info;
 
 typedef struct STBIR_RESIZE  // use the stbir_resize_init and stbir_override functions to set these values for future compatibility
 {
@@ -604,7 +629,7 @@ typedef struct STBIR_RESIZE  // use the stbir_resize_init and stbir_override fun
   stbir_edge horizontal_edge, vertical_edge;
   stbir__kernel_callback * horizontal_filter_kernel; stbir__support_callback * horizontal_filter_support;
   stbir__kernel_callback * vertical_filter_kernel; stbir__support_callback * vertical_filter_support;
-  stbir__info * samplers;      
+  stbir__info * samplers;
 } STBIR_RESIZE;
 
 // extended complexity api
@@ -620,7 +645,7 @@ STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize,
 // You can update these parameters any time after resize_init and there is no cost
 //--------------------------------
 
-STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type );         
+STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_type, stbir_datatype output_type );
 STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb );   // no callbacks by default
 STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data );                                               // pass back STBIR_RESIZE* by default
 STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes );
@@ -636,7 +661,7 @@ STBIRDEF int stbir_set_pixel_layouts( STBIR_RESIZE * resize, stbir_pixel_layout
 STBIRDEF int stbir_set_edgemodes( STBIR_RESIZE * resize, stbir_edge horizontal_edge, stbir_edge vertical_edge );       // CLAMP by default
 
 STBIRDEF int stbir_set_filters( STBIR_RESIZE * resize, stbir_filter horizontal_filter, stbir_filter vertical_filter ); // STBIR_DEFAULT_FILTER_UPSAMPLE/DOWNSAMPLE by default
-STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support ); 
+STBIRDEF int stbir_set_filter_callbacks( STBIR_RESIZE * resize, stbir__kernel_callback * horizontal_filter, stbir__support_callback * horizontal_support, stbir__kernel_callback * vertical_filter, stbir__support_callback * vertical_support );
 
 STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby, int subw, int subh );        // sets both sub-regions (full regions by default)
 STBIRDEF int stbir_set_input_subrect( STBIR_RESIZE * resize, double s0, double t0, double s1, double t1 );    // sets input sub-region (full region by default)
@@ -658,7 +683,7 @@ STBIRDEF int stbir_set_non_pm_alpha_speed_over_quality( STBIR_RESIZE * resize, i
 //--------------------------------
 
 // This builds the samplers and does one allocation
-STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize ); 
+STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize );
 
 // You MUST call this, if you call stbir_build_samplers or stbir_build_samplers_with_splits
 STBIRDEF void stbir_free_samplers( STBIR_RESIZE * resize );
@@ -681,7 +706,7 @@ STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize );
 //   It returns the number of splits (threads) that you can call it with.
 ///  It might be less if the image resize can't be split up that many ways.
 
-STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int try_splits );             
+STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int try_splits );
 
 // This function does a split of the resizing (you call this fuction for each
 // split, on multiple threads). A split is a piece of the output resize pixel space.
@@ -691,10 +716,10 @@ STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int try_sp
 // Usually, you will always call stbir_resize_split with split_start as the thread_index
 //   and "1" for the split_count.
 // But, if you have a weird situation where you MIGHT want 8 threads, but sometimes
-//   only 4 threads, you can use 0,2,4,6 for the split_start's and use "2" for the 
+//   only 4 threads, you can use 0,2,4,6 for the split_start's and use "2" for the
 //   split_count each time to turn in into a 4 thread resize. (This is unusual).
 
-STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count );         
+STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start, int split_count );
 //===============================================================
 
 
@@ -705,10 +730,10 @@ STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start
 //   The input callback is super flexible - it calls you with the input address
 //   (based on the stride and base pointer), it gives you an optional_output
 //   pointer that you can fill, or you can just return your own pointer into
-//   your own data. 
+//   your own data.
 //
-//   You can also do conversion from non-supported data types if necessary - in 
-//   this case, you ignore the input_ptr and just use the x and y parameters to 
+//   You can also do conversion from non-supported data types if necessary - in
+//   this case, you ignore the input_ptr and just use the x and y parameters to
 //   calculate your own input_ptr based on the size of each non-supported pixel.
 //   (Something like the third example below.)
 //
@@ -722,14 +747,14 @@ STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start
 //           return input_ptr;  // use buffer from call
 //        }
 //
-//     Next example, copying: (copy from some other buffer or stream):  
+//     Next example, copying: (copy from some other buffer or stream):
 //        void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
 //        {
 //           CopyOrStreamData( optional_output, other_data_src, num_pixels * pixel_width_in_bytes );
 //           return optional_output;  // return the optional buffer that we filled
 //        }
 //
-//     Third example, input another buffer without copying: (zero-copy from other buffer):  
+//     Third example, input another buffer without copying: (zero-copy from other buffer):
 //        void const * my_callback( void * optional_output, void const * input_ptr, int num_pixels, int x, int y, void * context )
 //        {
 //           void * pixels = ( (char*) other_image_base ) + ( y * other_image_stride ) + ( x * other_pixel_width_in_bytes );
@@ -758,7 +783,7 @@ STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start
 
 #ifdef STBIR_PROFILE
 
-typedef struct STBIR_PROFILE_INFO 
+typedef struct STBIR_PROFILE_INFO
 {
   stbir_uint64 total_clocks;
 
@@ -766,7 +791,7 @@ typedef struct STBIR_PROFILE_INFO
   //    there are "resize_count" number of zones
   stbir_uint64 clocks[ 8 ];
   char const ** descriptions;
-  
+
   // count of clocks and descriptions
   stbir_uint32 count;
 } STBIR_PROFILE_INFO;
@@ -865,15 +890,15 @@ STBIRDEF void stbir_resize_split_profile_info( STBIR_PROFILE_INFO * out_info, ST
 #endif
 
 // the internal pixel layout enums are in a different order, so we can easily do range comparisons of types
-//   the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible 
-typedef enum 
+//   the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible
+typedef enum
 {
   STBIRI_1CHANNEL = 0,
   STBIRI_2CHANNEL = 1,
   STBIRI_RGB      = 2,
   STBIRI_BGR      = 3,
   STBIRI_4CHANNEL = 4,
-  
+
   STBIRI_RGBA = 5,
   STBIRI_BGRA = 6,
   STBIRI_ARGB = 7,
@@ -979,7 +1004,7 @@ typedef struct
   stbir__span spans[2]; // can be two spans, if doing input subrect with clamp mode WRAP
 } stbir__extents;
 
-typedef struct 
+typedef struct
 {
 #ifdef STBIR_PROFILE
   union
@@ -1010,7 +1035,7 @@ typedef struct
 
 typedef void stbir__decode_pixels_func( float * decode, int width_times_channels, void const * input );
 typedef void stbir__alpha_weight_func( float * decode_buffer, int width_times_channels );
-typedef void stbir__horizontal_gather_channels_func( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer, 
+typedef void stbir__horizontal_gather_channels_func( float * output_buffer, unsigned int output_sub_size, float const * decode_buffer,
   stbir__contributors const * horizontal_contributors, float const * horizontal_coefficients, int coefficient_width );
 typedef void stbir__alpha_unweight_func(float * encode_buffer, int width_times_channels );
 typedef void stbir__encode_pixels_func( void * output, int width_times_channels, float const * encode );
@@ -1053,10 +1078,10 @@ struct stbir__info
   stbir__horizontal_gather_channels_func * horizontal_gather_channels;
   stbir__alpha_unweight_func * alpha_unweight;
   stbir__encode_pixels_func * encode_pixels;
-  
-  int alloced_total;
+
+  int alloc_ring_buffer_num_entries;    // Number of entries in the ring buffer that will be allocated
   int splits; // count of splits
-  
+
   stbir_internal_pixel_layout input_pixel_layout_internal;
   stbir_internal_pixel_layout output_pixel_layout_internal;
 
@@ -1065,7 +1090,7 @@ struct stbir__info
   int vertical_first;
   int channels;
   int effective_channels; // same as channels, except on RGBA/ARGB (7), or XA/AX (3)
-  int alloc_ring_buffer_num_entries;    // Number of entries in the ring buffer that will be allocated
+  size_t alloced_total;
 };
 
 
@@ -1076,10 +1101,11 @@ struct stbir__info
 #define stbir__small_float ((float)1 / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20) / (1 << 20))
 
 // min/max friendly
-#define STBIR_CLAMP(x, xmin, xmax) do { \
+#define STBIR_CLAMP(x, xmin, xmax) for(;;) { \
   if ( (x) < (xmin) ) (x) = (xmin);     \
   if ( (x) > (xmax) ) (x) = (xmax);     \
-} while (0)
+  break;                                \
+}
 
 static stbir__inline int stbir__min(int a, int b)
 {
@@ -1141,7 +1167,7 @@ static const stbir_uint32 fp32_to_srgb8_tab4[104] = {
   0x44c20798, 0x488e071e, 0x4c1c06b6, 0x4f76065d, 0x52a50610, 0x55ac05cc, 0x5892058f, 0x5b590559,
   0x5e0c0a23, 0x631c0980, 0x67db08f6, 0x6c55087f, 0x70940818, 0x74a007bd, 0x787d076c, 0x7c330723,
 };
- 
+
 static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
 {
   static const stbir__FP32 almostone = { 0x3f7fffff }; // 1-eps
@@ -1172,19 +1198,44 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
 #define STBIR_FORCE_GATHER_FILTER_SCANLINES_AMOUNT 32 // when downsampling and <= 32 scanlines of buffering, use gather. gather used down to 1/8th scaling for 25% win.
 #endif
 
-// restrict pointers for the output pointers
+#ifndef STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS
+#define STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS 4 // when threading, what is the minimum number of scanlines for a split?
+#endif
+
+// restrict pointers for the output pointers, other loop and unroll control
 #if defined( _MSC_VER ) && !defined(__clang__)
   #define STBIR_STREAMOUT_PTR( star ) star __restrict
   #define STBIR_NO_UNROLL( ptr ) __assume(ptr) // this oddly keeps msvc from unrolling a loop
-#elif defined(  __clang__ )
-  #define STBIR_STREAMOUT_PTR( star ) star __restrict__
-  #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr))
-#elif defined(  __GNUC__ )
+  #if _MSC_VER >= 1900
+    #define STBIR_NO_UNROLL_LOOP_START __pragma(loop( no_vector )) 
+  #else
+    #define STBIR_NO_UNROLL_LOOP_START 
+  #endif
+#elif defined( __clang__ )
+  #define STBIR_STREAMOUT_PTR( star ) star __restrict__
+  #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr)) 
+  #if ( __clang_major__ >= 4 ) || ( ( __clang_major__ >= 3 ) && ( __clang_minor__ >= 5 ) )
+    #define STBIR_NO_UNROLL_LOOP_START _Pragma("clang loop unroll(disable)") _Pragma("clang loop vectorize(disable)")
+  #else
+    #define STBIR_NO_UNROLL_LOOP_START
+  #endif 
+#elif defined( __GNUC__ )
   #define STBIR_STREAMOUT_PTR( star ) star __restrict__
   #define STBIR_NO_UNROLL( ptr ) __asm__ (""::"r"(ptr))
+  #if __GNUC__ >= 14
+    #define STBIR_NO_UNROLL_LOOP_START _Pragma("GCC unroll 0") _Pragma("GCC novector")
+  #else
+    #define STBIR_NO_UNROLL_LOOP_START
+  #endif
+  #define STBIR_NO_UNROLL_LOOP_START_INF_FOR
 #else
   #define STBIR_STREAMOUT_PTR( star ) star
   #define STBIR_NO_UNROLL( ptr )
+  #define STBIR_NO_UNROLL_LOOP_START
+#endif
+
+#ifndef STBIR_NO_UNROLL_LOOP_START_INF_FOR
+#define STBIR_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START
 #endif
 
 #ifdef STBIR_NO_SIMD // force simd off for whatever reason
@@ -1223,7 +1274,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
 
 #ifdef STBIR_SSE2
   #include <emmintrin.h>
-  
+
   #define stbir__simdf __m128
   #define stbir__simdi __m128i
 
@@ -1254,7 +1305,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
   #define stbir__simdi_store2( ptr, reg ) _mm_storel_epi64( (__m128i*)(ptr), (reg) )
 
   #define stbir__prefetch( ptr ) _mm_prefetch((char*)(ptr), _MM_HINT_T0 )
- 
+
   #define stbir__simdi_expand_u8_to_u32(out0,out1,out2,out3,ireg) \
   { \
     stbir__simdi zero = _mm_setzero_si128(); \
@@ -1285,7 +1336,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
   #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),_mm_setzero_ps()))))
   #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)_mm_cvtsi128_si32(_mm_cvttps_epi32(_mm_max_ps(_mm_min_ps(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),_mm_setzero_ps()))))
 
-  #define stbir__simdi_to_int( i ) _mm_cvtsi128_si32(i) 
+  #define stbir__simdi_to_int( i ) _mm_cvtsi128_si32(i)
   #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = _mm_cvtepi32_ps( ireg )
   #define stbir__simdf_add( out, reg0, reg1 ) (out) = _mm_add_ps( reg0, reg1 )
   #define stbir__simdf_mult( out, reg0, reg1 ) (out) = _mm_mul_ps( reg0, reg1 )
@@ -1440,10 +1491,10 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
 
     #define stbir__simdi8_convert_i32_to_float(out, ireg) (out) = _mm256_cvtepi32_ps( ireg )
     #define stbir__simdf8_convert_float_to_i32( i, f ) (i) = _mm256_cvttps_epi32(f)
-  
+
     #define stbir__simdf8_bot4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (0<<0)+(2<<4) )
     #define stbir__simdf8_top4s( out, a, b ) (out) = _mm256_permute2f128_ps(a,b, (1<<0)+(3<<4) )
-    
+
     #define stbir__simdf8_gettop4( reg ) _mm256_extractf128_ps(reg,1)
 
     #ifdef STBIR_AVX2
@@ -1471,8 +1522,8 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
       out = _mm256_castsi256_si128( _mm256_permute4x64_epi64( _mm256_packus_epi16( t, t ), (0<<0)+(2<<2)+(1<<4)+(3<<6) ) ); \
     }
 
-    #define stbir__simdi8_expand_u16_to_u32(out,ireg) out = _mm256_unpacklo_epi16( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), _mm256_setzero_si256() ); 
-  
+    #define stbir__simdi8_expand_u16_to_u32(out,ireg) out = _mm256_unpacklo_epi16( _mm256_permute4x64_epi64(_mm256_castsi128_si256(ireg),(0<<0)+(2<<2)+(1<<4)+(3<<6)), _mm256_setzero_si256() );
+
     #define stbir__simdf8_pack_to_16words(out,aa,bb) \
       { \
         stbir__simdf8 af,bf; \
@@ -1496,7 +1547,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
       a = _mm_unpackhi_epi8( ireg, zero ); \
       out1 = _mm256_setr_m128i( _mm_unpacklo_epi16( a, zero ), _mm_unpackhi_epi16( a, zero ) ); \
     }
-  
+
     #define stbir__simdf8_pack_to_16bytes(out,aa,bb) \
     { \
       stbir__simdi t; \
@@ -1514,7 +1565,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
       t = _mm_packus_epi16( t, t ); \
       out = _mm_castps_si128( _mm_shuffle_ps( _mm_castsi128_ps(out), _mm_castsi128_ps(t), (0<<0)+(1<<2)+(0<<4)+(1<<6) ) ); \
     }
-  
+
     #define stbir__simdi8_expand_u16_to_u32(out,ireg) \
     { \
       stbir__simdi a,b,zero = _mm_setzero_si128(); \
@@ -1549,7 +1600,6 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
 
     #define stbir__simdf8_0123to2222( out, in ) (out) = stbir__simdf_swiz(_mm256_castps256_ps128(in), 2,2,2,2 )
 
-    #define stbir__simdf8_load2( out, ptr ) (out) = _mm256_castsi256_ps(_mm256_castsi128_si256( _mm_loadl_epi64( (__m128i*)(ptr)) )) // top values can be random (not denormal or nan for perf)
     #define stbir__simdf8_load4b( out, ptr ) (out) = _mm256_broadcast_ps( (__m128 const *)(ptr) )
 
     static __m256i stbir_00112233 = { STBIR__CONST_4d_32i( 0, 0, 1, 1 ), STBIR__CONST_4d_32i( 2, 2, 3, 3 ) };
@@ -1582,11 +1632,11 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
     #ifdef STBIR_USE_FMA           // not on by default to maintain bit identical simd to non-simd
     #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_fmadd_ps( mul1, mul2, add )
     #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_fmadd_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ), add )
-    #define stbir__simdf8_madd_mem4( out, add, mul, ptr ) (out) = _mm256_fmadd_ps( _mm256_castps128_ps256( mul ), _mm256_castps128_ps256( _mm_loadu_ps( (float const*)(ptr) ) ), add )
+    #define stbir__simdf8_madd_mem4( out, add, mul, ptr )(out) = _mm256_fmadd_ps( _mm256_setr_m128( mul, _mm_setzero_ps() ), _mm256_setr_m128( _mm_loadu_ps( (float const*)(ptr) ), _mm_setzero_ps() ), add )
     #else
     #define stbir__simdf8_madd( out, add, mul1, mul2 ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul1, mul2 ) )
     #define stbir__simdf8_madd_mem( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_mul_ps( mul, _mm256_loadu_ps( (float const*)(ptr) ) ) )
-    #define stbir__simdf8_madd_mem4( out, add, mul, ptr ) (out) = _mm256_add_ps( add, _mm256_castps128_ps256( _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ) ) )
+    #define stbir__simdf8_madd_mem4( out, add, mul, ptr )  (out) = _mm256_add_ps( add, _mm256_setr_m128( _mm_mul_ps( mul, _mm_loadu_ps( (float const*)(ptr) ) ), _mm_setzero_ps() ) )
     #endif
     #define stbir__if_simdf8_cast_to_simdf4( val ) _mm256_castps256_ps128( val )
 
@@ -1627,7 +1677,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
   }
 
 #elif defined(STBIR_NEON)
-  
+
   #include <arm_neon.h>
 
   #define stbir__simdf float32x4_t
@@ -1686,7 +1736,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
 
   #define stbir__simdf_convert_float_to_i32( i, f ) (i) = vreinterpretq_u32_s32( vcvtq_s32_f32(f) )
   #define stbir__simdf_convert_float_to_int( f ) vgetq_lane_s32(vcvtq_s32_f32(f), 0)
-  #define stbir__simdi_to_int( i ) (int)vgetq_lane_u32(i, 0) 
+  #define stbir__simdi_to_int( i ) (int)vgetq_lane_u32(i, 0)
   #define stbir__simdf_convert_float_to_uint8( f ) ((unsigned char)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint8_as_float)),vdupq_n_f32(0))), 0))
   #define stbir__simdf_convert_float_to_short( f ) ((unsigned short)vgetq_lane_s32(vcvtq_s32_f32(vmaxq_f32(vminq_f32(f,STBIR__CONSTF(STBIR_max_uint16_as_float)),vdupq_n_f32(0))), 0))
   #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = vcvtq_f32_s32( vreinterpretq_s32_u32(ireg) )
@@ -1737,12 +1787,20 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
           ((stbir_uint64)(4*b+0)<<32) | ((stbir_uint64)(4*b+1)<<40) | ((stbir_uint64)(4*b+2)<<48) | ((stbir_uint64)(4*b+3)<<56)), \
         vcreate_u8( (4*c+0) | ((4*c+1)<<8) | ((4*c+2)<<16) | ((4*c+3)<<24) | \
           ((stbir_uint64)(4*d+0)<<32) | ((stbir_uint64)(4*d+1)<<40) | ((stbir_uint64)(4*d+2)<<48) | ((stbir_uint64)(4*d+3)<<56) ) )
+
+      static stbir__inline uint8x16x2_t stbir_make16x2(float32x4_t rega,float32x4_t regb)
+      {
+        uint8x16x2_t r = { vreinterpretq_u8_f32(rega), vreinterpretq_u8_f32(regb) };
+        return r;
+      }
     #else
       #define stbir_make16(a,b,c,d) (uint8x16_t){4*a+0,4*a+1,4*a+2,4*a+3,4*b+0,4*b+1,4*b+2,4*b+3,4*c+0,4*c+1,4*c+2,4*c+3,4*d+0,4*d+1,4*d+2,4*d+3}
+      #define stbir_make16x2(a,b) (uint8x16x2_t){{vreinterpretq_u8_f32(a),vreinterpretq_u8_f32(b)}}
     #endif
 
     #define stbir__simdf_swiz( reg, one, two, three, four ) vreinterpretq_f32_u8( vqtbl1q_u8( vreinterpretq_u8_f32(reg), stbir_make16(one, two, three, four) ) )
-  
+    #define stbir__simdf_swiz2( rega, regb, one, two, three, four ) vreinterpretq_f32_u8( vqtbl2q_u8( stbir_make16x2(rega,regb), stbir_make16(one, two, three, four) ) )
+
     #define stbir__simdi_16madd( out, reg0, reg1 ) \
     { \
       int16x8_t r0 = vreinterpretq_s16_u32(reg0); \
@@ -1942,7 +2000,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
 
   #define stbir__simdf_convert_float_to_i32( i, f )    (i) = wasm_i32x4_trunc_sat_f32x4(f)
   #define stbir__simdf_convert_float_to_int( f )       wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(f), 0)
-  #define stbir__simdi_to_int( i )                     wasm_i32x4_extract_lane(i, 0) 
+  #define stbir__simdi_to_int( i )                     wasm_i32x4_extract_lane(i, 0)
   #define stbir__simdf_convert_float_to_uint8( f )     ((unsigned char)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint8_as_float),wasm_f32x4_const_splat(0))), 0))
   #define stbir__simdf_convert_float_to_short( f )     ((unsigned short)wasm_i32x4_extract_lane(wasm_i32x4_trunc_sat_f32x4(wasm_f32x4_max(wasm_f32x4_min(f,STBIR_max_uint16_as_float),wasm_f32x4_const_splat(0))), 0))
   #define stbir__simdi_convert_i32_to_float(out, ireg) (out) = wasm_f32x4_convert_i32x4(ireg)
@@ -2125,7 +2183,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
 #endif
 
 
-#if defined(STBIR_NEON) && !defined(_M_ARM)
+#if defined(STBIR_NEON) && !defined(_M_ARM) && !defined(__arm__)
 
   #if defined( _MSC_VER ) && !defined(__clang__)
   typedef __int16 stbir__FP16;
@@ -2142,7 +2200,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
 
 #endif
 
-#if !defined(STBIR_NEON) && !defined(STBIR_FP16C) || defined(STBIR_NEON) && defined(_M_ARM)
+#if (!defined(STBIR_NEON) && !defined(STBIR_FP16C)) || (defined(STBIR_NEON) && defined(_M_ARM)) || (defined(STBIR_NEON) && defined(__arm__))
 
   // Fabian's half float routines, see: https://gist.github.com/rygorous/2156668
 
@@ -2168,7 +2226,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
     unsigned int sign_mask = 0x80000000u;
     stbir__FP16 o = { 0 };
     stbir__FP32 f;
-    unsigned int sign; 
+    unsigned int sign;
 
     f.f = val;
     sign = f.u & sign_mask;
@@ -2369,24 +2427,6 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
     stbir__simdi_store( output,final );
   }
 
-#elif defined(STBIR_WASM) || (defined(STBIR_NEON) && defined(_MSC_VER) && defined(_M_ARM)) // WASM or 32-bit ARM on MSVC/clang
-
-  static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
-  {
-    for (int i=0; i<8; i++)
-    {
-      output[i] = stbir__half_to_float(input[i]);
-    }
-  }
-
-  static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
-  {
-    for (int i=0; i<8; i++)
-    {
-      output[i] = stbir__float_to_half(input[i]);
-    }
-  }
-
 #elif defined(STBIR_NEON) && defined(_MSC_VER) && defined(_M_ARM64) && !defined(__clang__) // 64-bit ARM on MSVC (not clang)
 
   static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
@@ -2415,7 +2455,7 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
     return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0).n16_u16[0];
   }
 
-#elif defined(STBIR_NEON) // 64-bit ARM
+#elif defined(STBIR_NEON) && ( defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) ) // 64-bit ARM
 
   static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
   {
@@ -2441,6 +2481,23 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
     return vget_lane_f16(vcvt_f16_f32(vdupq_n_f32(f)), 0);
   }
 
+#elif defined(STBIR_WASM) || (defined(STBIR_NEON) && (defined(_MSC_VER) || defined(_M_ARM) || defined(__arm__))) // WASM or 32-bit ARM on MSVC/clang
+
+  static stbir__inline void stbir__half_to_float_SIMD(float * output, stbir__FP16 const * input)
+  {
+    for (int i=0; i<8; i++)
+    {
+      output[i] = stbir__half_to_float(input[i]);
+    }
+  }
+  static stbir__inline void stbir__float_to_half_SIMD(stbir__FP16 * output, float const * input)
+  {
+    for (int i=0; i<8; i++)
+    {
+      output[i] = stbir__float_to_half(input[i]);
+    }
+  }
+
 #endif
 
 
@@ -2462,10 +2519,10 @@ static stbir__inline stbir_uint8 stbir__linear_to_srgb_uchar(float in)
 #define stbir__simdf_0123to3012( out, reg ) (out) = stbir__simdf_swiz( reg, 3,0,1,2 )
 #define stbir__simdf_0123to0011( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,1,1 )
 #define stbir__simdf_0123to1100( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,0,0 )
-#define stbir__simdf_0123to2233( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,3,3 ) 
-#define stbir__simdf_0123to1133( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,3,3 ) 
-#define stbir__simdf_0123to0022( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,2 ) 
-#define stbir__simdf_0123to1032( out, reg ) (out) = stbir__simdf_swiz( reg, 1,0,3,2 ) 
+#define stbir__simdf_0123to2233( out, reg ) (out) = stbir__simdf_swiz( reg, 2,2,3,3 )
+#define stbir__simdf_0123to1133( out, reg ) (out) = stbir__simdf_swiz( reg, 1,1,3,3 )
+#define stbir__simdf_0123to0022( out, reg ) (out) = stbir__simdf_swiz( reg, 0,0,2,2 )
+#define stbir__simdf_0123to1032( out, reg ) (out) = stbir__simdf_swiz( reg, 1,0,3,2 )
 
 typedef union stbir__simdi_u32
 {
@@ -2493,14 +2550,16 @@ static const STBIR__SIMDI_CONST(STBIR_topscale,      0x02000000);
 //   Adding this switch saves about 5K on clang which is Captain Unroll the 3rd.
 #define STBIR_SIMD_STREAMOUT_PTR( star )  STBIR_STREAMOUT_PTR( star )
 #define STBIR_SIMD_NO_UNROLL(ptr) STBIR_NO_UNROLL(ptr)
+#define STBIR_SIMD_NO_UNROLL_LOOP_START STBIR_NO_UNROLL_LOOP_START
+#define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR STBIR_NO_UNROLL_LOOP_START_INF_FOR
 
 #ifdef STBIR_MEMCPY
 #undef STBIR_MEMCPY
-#define STBIR_MEMCPY stbir_simd_memcpy
 #endif
+#define STBIR_MEMCPY stbir_simd_memcpy
 
 // override normal use of memcpy with much simpler copy (faster and smaller with our sized copies)
-static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes ) 
+static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes )
 {
   char STBIR_SIMD_STREAMOUT_PTR (*) d = (char*) dest;
   char STBIR_SIMD_STREAMOUT_PTR( * ) d_end = ((char*) dest) + bytes;
@@ -2513,8 +2572,9 @@ static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes )
   {
     if ( bytes < 16 )
     {
-      if ( bytes ) 
+      if ( bytes )
       {
+        STBIR_SIMD_NO_UNROLL_LOOP_START
         do
         {
           STBIR_SIMD_NO_UNROLL(d);
@@ -2529,8 +2589,9 @@ static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes )
       // do one unaligned to get us aligned for the stream out below
       stbir__simdf_load( x, ( d + ofs_to_src ) );
       stbir__simdf_store( d, x );
-      d = (char*)( ( ( (ptrdiff_t)d ) + 16 ) & ~15 );
+      d = (char*)( ( ( (size_t)d ) + 16 ) & ~15 );
 
+      STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
       for(;;)
       {
         STBIR_SIMD_NO_UNROLL(d);
@@ -2561,12 +2622,13 @@ static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes )
     stbir__simdfX_store( d +  4*stbir__simdfX_float_count, x1 );
     stbir__simdfX_store( d +  8*stbir__simdfX_float_count, x2 );
     stbir__simdfX_store( d + 12*stbir__simdfX_float_count, x3 );
-    d = (char*)( ( ( (ptrdiff_t)d ) + (16*stbir__simdfX_float_count) ) & ~((16*stbir__simdfX_float_count)-1) );
+    d = (char*)( ( ( (size_t)d ) + (16*stbir__simdfX_float_count) ) & ~((16*stbir__simdfX_float_count)-1) );
 
+    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       STBIR_SIMD_NO_UNROLL(d);
-  
+
       if ( d > ( d_end - (16*stbir__simdfX_float_count) ) )
       {
         if ( d == d_end )
@@ -2590,7 +2652,7 @@ static void stbir_simd_memcpy( void * dest, void const * src, size_t bytes )
 // memcpy that is specically intentionally overlapping (src is smaller then dest, so can be
 //   a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to
 //   the diff between dest and src)
-static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes ) 
+static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes )
 {
   char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src;
   char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes;
@@ -2599,6 +2661,7 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
   if ( ofs_to_dest >= 16 ) // is the overlap more than 16 away?
   {
     char STBIR_SIMD_STREAMOUT_PTR( * ) s_end16 = ((char*) src) + (bytes&~15);
+    STBIR_SIMD_NO_UNROLL_LOOP_START
     do
     {
       stbir__simdf x;
@@ -2615,7 +2678,7 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
   do
   {
     STBIR_SIMD_NO_UNROLL(sd);
-    *(int*)( sd + ofs_to_dest ) = *(int*) sd; 
+    *(int*)( sd + ofs_to_dest ) = *(int*) sd;
     sd += 4;
   } while ( sd < s_end );
 }
@@ -2624,13 +2687,17 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
 
 // when in scalar mode, we let unrolling happen, so this macro just does the __restrict
 #define STBIR_SIMD_STREAMOUT_PTR( star ) STBIR_STREAMOUT_PTR( star )
-#define STBIR_SIMD_NO_UNROLL(ptr) 
+#define STBIR_SIMD_NO_UNROLL(ptr)
+#define STBIR_SIMD_NO_UNROLL_LOOP_START
+#define STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
 
 #endif // SSE2
 
 
 #ifdef STBIR_PROFILE
 
+#ifndef STBIR_PROFILE_FUNC
+
 #if defined(_x86_64) || defined( __x86_64__ ) || defined( _M_X64 ) || defined(__x86_64) || defined(__SSE2__) || defined(STBIR_SSE) || defined( _M_IX86_FP ) || defined(__i386) || defined( __i386__ ) || defined( _M_IX86 ) || defined( _X86_ )
 
 #ifdef _MSC_VER
@@ -2640,7 +2707,7 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
 
 #else // non msvc
 
-  static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC() 
+  static stbir__inline stbir_uint64 STBIR_PROFILE_FUNC()
   {
     stbir_uint32 lo, hi;
     asm volatile ("rdtsc" : "=a" (lo), "=d" (hi) );
@@ -2649,7 +2716,7 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
 
 #endif  // msvc
 
-#elif defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || defined(__ARM_NEON__) 
+#elif defined( _M_ARM64 ) || defined( __aarch64__ ) || defined( __arm64__ ) || defined(__ARM_NEON__)
 
 #if defined( _MSC_VER ) && !defined(__clang__)
 
@@ -2670,8 +2737,9 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
 
 #error Unknown platform for profiling.
 
-#endif  //x64 and   
+#endif  // x64, arm
 
+#endif // STBIR_PROFILE_FUNC
 
 #define STBIR_ONLY_PROFILE_GET_SPLIT_INFO ,stbir__per_split_info * split_info
 #define STBIR_ONLY_PROFILE_SET_SPLIT_INFO ,split_info
@@ -2680,7 +2748,7 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
 #define STBIR_ONLY_PROFILE_BUILD_SET_INFO ,profile_info
 
 // super light-weight micro profiler
-#define STBIR_PROFILE_START_ll( info, wh ) { stbir_uint64 wh##thiszonetime = STBIR_PROFILE_FUNC(); stbir_uint64 * wh##save_parent_excluded_ptr = info->current_zone_excluded_ptr; stbir_uint64 wh##current_zone_excluded = 0; info->current_zone_excluded_ptr = &wh##current_zone_excluded; 
+#define STBIR_PROFILE_START_ll( info, wh ) { stbir_uint64 wh##thiszonetime = STBIR_PROFILE_FUNC(); stbir_uint64 * wh##save_parent_excluded_ptr = info->current_zone_excluded_ptr; stbir_uint64 wh##current_zone_excluded = 0; info->current_zone_excluded_ptr = &wh##current_zone_excluded;
 #define STBIR_PROFILE_END_ll( info, wh ) wh##thiszonetime = STBIR_PROFILE_FUNC() - wh##thiszonetime; info->profile.named.wh += wh##thiszonetime - wh##current_zone_excluded; *wh##save_parent_excluded_ptr += wh##thiszonetime; info->current_zone_excluded_ptr = wh##save_parent_excluded_ptr; }
 #define STBIR_PROFILE_FIRST_START_ll( info, wh ) { int i; info->current_zone_excluded_ptr = &info->profile.named.total; for(i=0;i<STBIR__ARRAY_SIZE(info->profile.array);i++) info->profile.array[i]=0; } STBIR_PROFILE_START_ll( info, wh );
 #define STBIR_PROFILE_CLEAR_EXTRAS_ll( info, num ) { int extra; for(extra=1;extra<(num);extra++) { int i; for(i=0;i<STBIR__ARRAY_SIZE((info)->profile.array);i++) (info)[extra].profile.array[i]=0; } }
@@ -2710,8 +2778,8 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
 #define STBIR_PROFILE_FIRST_START( wh )
 #define STBIR_PROFILE_CLEAR_EXTRAS( )
 
-#define STBIR_PROFILE_BUILD_START( wh ) 
-#define STBIR_PROFILE_BUILD_END( wh ) 
+#define STBIR_PROFILE_BUILD_START( wh )
+#define STBIR_PROFILE_BUILD_END( wh )
 #define STBIR_PROFILE_BUILD_FIRST_START( wh )
 #define STBIR_PROFILE_BUILD_CLEAR( info )
 
@@ -2736,10 +2804,10 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
 
 #ifndef STBIR_SIMD
 
-// memcpy that is specically intentionally overlapping (src is smaller then dest, so can be
+// memcpy that is specifically intentionally overlapping (src is smaller then dest, so can be
 //   a normal forward copy, bytes is divisible by 4 and bytes is greater than or equal to
 //   the diff between dest and src)
-static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes ) 
+static void stbir_overlapping_memcpy( void * dest, void const * src, size_t bytes )
 {
   char STBIR_SIMD_STREAMOUT_PTR (*) sd = (char*) src;
   char STBIR_SIMD_STREAMOUT_PTR( * ) s_end = ((char*) src) + bytes;
@@ -2748,10 +2816,11 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
   if ( ofs_to_dest >= 8 ) // is the overlap more than 8 away?
   {
     char STBIR_SIMD_STREAMOUT_PTR( * ) s_end8 = ((char*) src) + (bytes&~7);
+    STBIR_NO_UNROLL_LOOP_START
     do
     {
       STBIR_NO_UNROLL(sd);
-      *(stbir_uint64*)( sd + ofs_to_dest ) = *(stbir_uint64*) sd; 
+      *(stbir_uint64*)( sd + ofs_to_dest ) = *(stbir_uint64*) sd;
       sd += 8;
     } while ( sd < s_end8 );
 
@@ -2759,10 +2828,11 @@ static void stbir_overlapping_memcpy( void * dest, void const * src, size_t byte
       return;
   }
 
+  STBIR_NO_UNROLL_LOOP_START
   do
   {
     STBIR_NO_UNROLL(sd);
-    *(int*)( sd + ofs_to_dest ) = *(int*) sd; 
+    *(int*)( sd + ofs_to_dest ) = *(int*) sd;
     sd += 4;
   } while ( sd < s_end );
 }
@@ -2863,13 +2933,6 @@ static float stbir__filter_mitchell(float x, float s, void * user_data)
   return (0.0f);
 }
 
-static float stbir__support_zero(float s, void * user_data)
-{
-  STBIR__UNUSED(s);
-  STBIR__UNUSED(user_data);
-  return 0;
-}
-
 static float stbir__support_zeropoint5(float s, void * user_data)
 {
   STBIR__UNUSED(s);
@@ -2884,7 +2947,7 @@ static float stbir__support_one(float s, void * user_data)
   return 1;
 }
 
-static float stbir__support_two(float s, void * user_data) 
+static float stbir__support_two(float s, void * user_data)
 {
   STBIR__UNUSED(s);
   STBIR__UNUSED(user_data);
@@ -2903,7 +2966,7 @@ static int stbir__get_filter_pixel_width(stbir__support_callback * support, floa
     return (int)STBIR_CEILF(support(scale,user_data) * 2.0f / scale);
 }
 
-// this is how many coefficents per run of the filter (which is different 
+// this is how many coefficents per run of the filter (which is different
 //   from the filter_pixel_width depending on if we are scattering or gathering)
 static int stbir__get_coefficient_width(stbir__sampler * samp, int is_gather, void * user_data)
 {
@@ -2924,7 +2987,7 @@ static int stbir__get_coefficient_width(stbir__sampler * samp, int is_gather, vo
   }
 }
 
-static int stbir__get_contributors(stbir__sampler * samp, int is_gather)  
+static int stbir__get_contributors(stbir__sampler * samp, int is_gather)
 {
   if (is_gather)
       return samp->scale_info.output_sub_size;
@@ -2954,7 +3017,7 @@ static int stbir__edge_reflect_full( int n, int max )
 {
   if (n < 0)
   {
-    if (n > -max)    
+    if (n > -max)
       return -n;
     else
       return max - 1;
@@ -3056,7 +3119,7 @@ static void stbir__get_extents( stbir__sampler * samp, stbir__extents * scanline
     left_margin = -min_n;
     min_n = 0;
   }
-  
+
   right_margin = 0;
   if ( max_n >= input_full_size )
   {
@@ -3081,7 +3144,7 @@ static void stbir__get_extents( stbir__sampler * samp, stbir__extents * scanline
   // don't have to do edge calc for zero clamp
   if ( edge == STBIR_EDGE_ZERO )
     return;
-  
+
   // convert margin pixels to the pixels within the input (min and max)
   for( j = -left_margin ; j < 0 ; j++ )
   {
@@ -3179,20 +3242,20 @@ static void stbir__calculate_in_pixel_range( int * first_pixel, int * last_pixel
   float out_pixel_influence_lowerbound = out_pixel_center - out_filter_radius;
   float out_pixel_influence_upperbound = out_pixel_center + out_filter_radius;
 
-  float in_pixel_influence_lowerbound = (out_pixel_influence_lowerbound + out_shift) * inv_scale; 
-  float in_pixel_influence_upperbound = (out_pixel_influence_upperbound + out_shift) * inv_scale; 
+  float in_pixel_influence_lowerbound = (out_pixel_influence_lowerbound + out_shift) * inv_scale;
+  float in_pixel_influence_upperbound = (out_pixel_influence_upperbound + out_shift) * inv_scale;
 
   first = (int)(STBIR_FLOORF(in_pixel_influence_lowerbound + 0.5f));
   last = (int)(STBIR_FLOORF(in_pixel_influence_upperbound - 0.5f));
 
   if ( edge == STBIR_EDGE_WRAP )
   {
-    if ( first <= -input_size )
-      first = -(input_size-1);
+    if ( first < -input_size )
+      first = -input_size;
     if ( last >= (input_size*2))
       last = (input_size*2) - 1;
   }
-  
+
   *first_pixel = first;
   *last_pixel = last;
 }
@@ -3213,10 +3276,10 @@ static void stbir__calculate_coefficients_for_gather_upsample( float out_filter_
     int i;
     int last_non_zero;
     float out_pixel_center = (float)n + 0.5f;
-    float in_center_of_out = (out_pixel_center + out_shift) * inv_scale;  
+    float in_center_of_out = (out_pixel_center + out_shift) * inv_scale;
 
     int in_first_pixel, in_last_pixel;
-    
+
     stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, out_pixel_center, out_filter_radius, inv_scale, out_shift, input_size, edge );
 
     last_non_zero = -1;
@@ -3229,7 +3292,7 @@ static void stbir__calculate_coefficients_for_gather_upsample( float out_filter_
       if ( ( ( coeff < stbir__small_float ) && ( coeff > -stbir__small_float ) ) )
       {
         if ( i == 0 )  // if we're at the front, just eat zero contributors
-        { 
+        {
           STBIR_ASSERT ( ( in_last_pixel - in_first_pixel ) != 0 ); // there should be at least one contrib
           ++in_first_pixel;
           i--;
@@ -3239,10 +3302,10 @@ static void stbir__calculate_coefficients_for_gather_upsample( float out_filter_
       }
       else
         last_non_zero = i;
-      
+
       coefficient_group[i] = coeff;
     }
-    
+
     in_last_pixel = last_non_zero+in_first_pixel; // kills trailing zeros
     contributors->n0 = in_first_pixel;
     contributors->n1 = in_last_pixel;
@@ -3354,7 +3417,7 @@ static void stbir__calculate_coefficients_for_gather_downsample( int start, int
         stbir__contributors * contribs = contributors + out;
 
         // is this the first time this output pixel has been seen?  Init it.
-        if ( out > first_out_inited ) 
+        if ( out > first_out_inited )
         {
           STBIR_ASSERT( out == ( first_out_inited + 1 ) ); // ensure we have only advanced one at time
           first_out_inited = out;
@@ -3362,7 +3425,7 @@ static void stbir__calculate_coefficients_for_gather_downsample( int start, int
           contribs->n1 = in_pixel;
           coeffs[0]  = coeff;
         }
-        else 
+        else
         {
           // insert on end (always in order)
           if ( coeffs[0] == 0.0f )  // if the first coefficent is zero, then zap it for this coeffs
@@ -3379,10 +3442,16 @@ static void stbir__calculate_coefficients_for_gather_downsample( int start, int
   }
 }
 
+#ifdef STBIR_RENORMALIZE_IN_FLOAT
+#define STBIR_RENORM_TYPE float
+#else
+#define STBIR_RENORM_TYPE double
+#endif
+
 static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter_extent_info* filter_info, stbir__scale_info * scale_info, int num_contributors, stbir__contributors* contributors, float * coefficient_group, int coefficient_width )
 {
   int input_size = scale_info->input_full_size;
-  int input_last_n1 = input_size - 1; 
+  int input_last_n1 = input_size - 1;
   int n, end;
   int lowest = 0x7fffffff;
   int highest = -0x7fffffff;
@@ -3400,14 +3469,14 @@ static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter
   for (n = 0; n < end; n++)
   {
     int i;
-    float filter_scale, total_filter = 0;
+    STBIR_RENORM_TYPE filter_scale, total_filter = 0;
     int e;
 
     // add all contribs
     e = contribs->n1 - contribs->n0;
     for( i = 0 ; i <= e ; i++ )
     {
-      total_filter += coeffs[i];
+      total_filter += (STBIR_RENORM_TYPE) coeffs[i];
       STBIR_ASSERT( ( coeffs[i] >= -2.0f ) && ( coeffs[i] <= 2.0f )  ); // check for wonky weights
     }
 
@@ -3423,10 +3492,11 @@ static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter
       // if the total isn't 1.0, rescale everything
       if ( ( total_filter < (1.0f-stbir__small_float) ) || ( total_filter > (1.0f+stbir__small_float) ) )
       {
-        filter_scale = 1.0f / total_filter;
+        filter_scale = ((STBIR_RENORM_TYPE)1.0) / total_filter;
+
         // scale them all
         for (i = 0; i <= e; i++)
-          coeffs[i] *= filter_scale;
+          coeffs[i] = (float) ( coeffs[i] * filter_scale );
       }
     }
     ++contribs;
@@ -3483,13 +3553,13 @@ static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter
     else if ( ( edge == STBIR_EDGE_CLAMP ) || ( edge == STBIR_EDGE_REFLECT ) )
     {
       // for clamp and reflect, calculate the true inbounds position (based on edge type) and just add that to the existing weight
-      
+
       // right hand side first
       if ( contribs->n1 > input_last_n1 )
       {
         int start = contribs->n0;
         int endi = contribs->n1;
-        contribs->n1 = input_last_n1;  
+        contribs->n1 = input_last_n1;
         for( i = input_size; i <= endi; i++ )
           stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), coeffs[i-start] );
       }
@@ -3500,18 +3570,18 @@ static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter
         int save_n0;
         float save_n0_coeff;
         float * c = coeffs - ( contribs->n0 + 1 );
-        
+
         // reinsert the coeffs with it reflected or clamped (insert accumulates, if the coeffs exist)
-        for( i = -1 ; i > contribs->n0 ; i-- ) 
+        for( i = -1 ; i > contribs->n0 ; i-- )
           stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( i, input_size ), *c-- );
         save_n0 = contribs->n0;
         save_n0_coeff = c[0]; // save it, since we didn't do the final one (i==n0), because there might be too many coeffs to hold (before we resize)!
 
         // now slide all the coeffs down (since we have accumulated them in the positive contribs) and reset the first contrib
-        contribs->n0 = 0;  
+        contribs->n0 = 0;
         for(i = 0 ; i <= contribs->n1 ; i++ )
           coeffs[i] = coeffs[i-save_n0];
-        
+
         // now that we have shrunk down the contribs, we insert the first one safely
         stbir__insert_coeff( contribs, coeffs, stbir__edge_wrap_slow[edge]( save_n0, input_size ), save_n0_coeff );
       }
@@ -3547,7 +3617,9 @@ static void stbir__cleanup_gathered_coefficients( stbir_edge edge, stbir__filter
   filter_info->widest = widest;
 }
 
-static int stbir__pack_coefficients( int num_contributors, stbir__contributors* contributors, float * coefficents, int coefficient_width, int widest, int row_width )
+#undef STBIR_RENORM_TYPE 
+
+static int stbir__pack_coefficients( int num_contributors, stbir__contributors* contributors, float * coefficents, int coefficient_width, int widest, int row0, int row1 ) 
 {
   #define STBIR_MOVE_1( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint32*)(dest))[0] = ((stbir_uint32*)(src))[0]; }
   #define STBIR_MOVE_2( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; }
@@ -3556,6 +3628,10 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
   #else
   #define STBIR_MOVE_4( dest, src ) { STBIR_NO_UNROLL(dest); ((stbir_uint64*)(dest))[0] = ((stbir_uint64*)(src))[0]; ((stbir_uint64*)(dest))[1] = ((stbir_uint64*)(src))[1]; }
   #endif
+
+  int row_end = row1 + 1;
+  STBIR__UNUSED( row0 ); // only used in an assert
+
   if ( coefficient_width != widest )
   {
     float * pc = coefficents;
@@ -3564,6 +3640,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
     switch( widest )
     {
       case 1:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_1( pc, coeffs );
           ++pc;
@@ -3571,6 +3648,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 2:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_2( pc, coeffs );
           pc += 2;
@@ -3578,6 +3656,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 3:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_2( pc, coeffs );
           STBIR_MOVE_1( pc+2, coeffs+2 );
@@ -3586,6 +3665,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 4:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_4( pc, coeffs );
           pc += 4;
@@ -3593,6 +3673,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 5:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_4( pc, coeffs );
           STBIR_MOVE_1( pc+4, coeffs+4 );
@@ -3601,6 +3682,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 6:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_4( pc, coeffs );
           STBIR_MOVE_2( pc+4, coeffs+4 );
@@ -3609,6 +3691,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 7:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_4( pc, coeffs );
           STBIR_MOVE_2( pc+4, coeffs+4 );
@@ -3618,6 +3701,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 8:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_4( pc, coeffs );
           STBIR_MOVE_4( pc+4, coeffs+4 );
@@ -3626,6 +3710,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 9:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_4( pc, coeffs );
           STBIR_MOVE_4( pc+4, coeffs+4 );
@@ -3635,6 +3720,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 10:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_4( pc, coeffs );
           STBIR_MOVE_4( pc+4, coeffs+4 );
@@ -3644,6 +3730,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 11:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_4( pc, coeffs );
           STBIR_MOVE_4( pc+4, coeffs+4 );
@@ -3654,6 +3741,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       case 12:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           STBIR_MOVE_4( pc, coeffs );
           STBIR_MOVE_4( pc+4, coeffs+4 );
@@ -3663,6 +3751,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
         } while ( pc < pc_end );
         break;
       default:
+        STBIR_NO_UNROLL_LOOP_START
         do {
           float * copy_end = pc + widest - 4;
           float * c = coeffs;
@@ -3673,6 +3762,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
             c += 4;
           } while ( pc <= copy_end );
           copy_end += 4;
+          STBIR_NO_UNROLL_LOOP_START
           while ( pc < copy_end )
           {
             STBIR_MOVE_1( pc, c );
@@ -3688,7 +3778,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
   coefficents[ widest * num_contributors ] = 8888.0f;
 
   // the minimum we might read for unrolled filters widths is 12. So, we need to
-  //   make sure we never read outside the decode buffer, by possibly moving 
+  //   make sure we never read outside the decode buffer, by possibly moving
   //   the sample area back into the scanline, and putting zeros weights first.
   // we start on the right edge and check until we're well past the possible
   //   clip area (2*widest).
@@ -3697,13 +3787,13 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
     float * coeffs = coefficents + widest * ( num_contributors - 1 );
 
     // go until no chance of clipping (this is usually less than 8 lops)
-    while ( ( ( contribs->n0 + widest*2 ) >= row_width ) && ( contribs >= contributors ) )
+    while ( ( contribs >= contributors ) && ( ( contribs->n0 + widest*2 ) >= row_end ) )
     {
       // might we clip??
-      if ( ( contribs->n0 + widest ) > row_width )
+      if ( ( contribs->n0 + widest ) > row_end )
       {
         int stop_range = widest;
-      
+
         // if range is larger than 12, it will be handled by generic loops that can terminate on the exact length
         //   of this contrib n1, instead of a fixed widest amount - so calculate this
         if ( widest > 12 )
@@ -3712,22 +3802,22 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
 
           // how far will be read in the n_coeff loop (which depends on the widest count mod4);
           mod = widest & 3;
-          stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod; 
+          stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod;
 
           // the n_coeff loops do a minimum amount of coeffs, so factor that in!
           if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod;
         }
 
         // now see if we still clip with the refined range
-        if ( ( contribs->n0 + stop_range ) > row_width )
+        if ( ( contribs->n0 + stop_range ) > row_end )
         {
-          int new_n0 = row_width - stop_range;
+          int new_n0 = row_end - stop_range;
           int num = contribs->n1 - contribs->n0 + 1;
           int backup = contribs->n0 - new_n0;
           float * from_co = coeffs + num - 1;
           float * to_co = from_co + backup;
 
-          STBIR_ASSERT( ( new_n0 >= 0 ) && ( new_n0 < contribs->n0 ) );
+          STBIR_ASSERT( ( new_n0 >= row0 ) && ( new_n0 < contribs->n0 ) );
 
           // move the coeffs over
           while( num )
@@ -3746,7 +3836,7 @@ static int stbir__pack_coefficients( int num_contributors, stbir__contributors*
 
             // how far will be read in the n_coeff loop (which depends on the widest count mod4);
             mod = widest & 3;
-            stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod; 
+            stop_range = ( ( ( contribs->n1 - contribs->n0 + 1 ) - mod + 3 ) & ~3 ) + mod;
 
             // the n_coeff loops do a minimum amount of coeffs, so factor that in!
             if ( stop_range < ( 8 + mod ) ) stop_range = 8 + mod;
@@ -3774,7 +3864,7 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot
   int input_full_size = samp->scale_info.input_full_size;
   int gather_num_contributors = samp->num_contributors;
   stbir__contributors* gather_contributors = samp->contributors;
-  float * gather_coeffs = samp->coefficients; 
+  float * gather_coeffs = samp->coefficients;
   int gather_coefficient_width = samp->coefficient_width;
 
   switch ( samp->is_gather )
@@ -3792,16 +3882,16 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot
     break;
 
     case 0: // scatter downsample (only on vertical)
-    case 2: // gather downsample  
+    case 2: // gather downsample
     {
       float in_pixels_radius = support(scale,user_data) * inv_scale;
       int filter_pixel_margin = samp->filter_pixel_margin;
       int input_end = input_full_size + filter_pixel_margin;
-      
+
       // if this is a scatter, we do a downsample gather to get the coeffs, and then pivot after
       if ( !samp->is_gather )
       {
-        // check if we are using the same gather downsample on the horizontal as this vertical, 
+        // check if we are using the same gather downsample on the horizontal as this vertical,
         //   if so, then we don't have to generate them, we can just pivot from the horizontal.
         if ( other_axis_for_pivot )
         {
@@ -3846,30 +3936,37 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot
           float * scatter_coeffs = samp->coefficients + ( gn0 + filter_pixel_margin ) * scatter_coefficient_width;
           float * g_coeffs = gather_coeffs;
           scatter_contributors = samp->contributors + ( gn0 + filter_pixel_margin );
-          
+
           for (k = gn0 ; k <= gn1 ; k++ )
           {
             float gc = *g_coeffs++;
-            if ( ( k > highest_set ) || ( scatter_contributors->n0 > scatter_contributors->n1 ) )
+            
+            // skip zero and denormals - must skip zeros to avoid adding coeffs beyond scatter_coefficient_width
+            //   (which happens when pivoting from horizontal, which might have dummy zeros)
+            if ( ( ( gc >= stbir__small_float ) || ( gc <= -stbir__small_float ) ) )
             {
+              if ( ( k > highest_set ) || ( scatter_contributors->n0 > scatter_contributors->n1 ) )
               {
-                // if we are skipping over several contributors, we need to clear the skipped ones
-                stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1);
-                while ( clear_contributors < scatter_contributors )
                 {
-                  clear_contributors->n0 = 0; 
-                  clear_contributors->n1 = -1;
-                  ++clear_contributors;
+                  // if we are skipping over several contributors, we need to clear the skipped ones
+                  stbir__contributors * clear_contributors = samp->contributors + ( highest_set + filter_pixel_margin + 1);
+                  while ( clear_contributors < scatter_contributors )
+                  {
+                    clear_contributors->n0 = 0;
+                    clear_contributors->n1 = -1;
+                    ++clear_contributors;
+                  }
                 }
+                scatter_contributors->n0 = n;
+                scatter_contributors->n1 = n;
+                scatter_coeffs[0]  = gc;
+                highest_set = k;
               }
-              scatter_contributors->n0 = n;
-              scatter_contributors->n1 = n;
-              scatter_coeffs[0]  = gc;
-              highest_set = k;
-            }
-            else
-            {
-              stbir__insert_coeff( scatter_contributors, scatter_coeffs, n, gc );
+              else
+              {
+                stbir__insert_coeff( scatter_contributors, scatter_coeffs, n, gc );
+              }
+              STBIR_ASSERT( ( scatter_contributors->n1 - scatter_contributors->n0 + 1 ) <= scatter_coefficient_width );
             }
             ++scatter_contributors;
             scatter_coeffs += scatter_coefficient_width;
@@ -3908,11 +4005,11 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot
 
 #define stbir__decode_suffix BGRA
 #define stbir__decode_swizzle
-#define stbir__decode_order0  2 
+#define stbir__decode_order0  2
 #define stbir__decode_order1  1
 #define stbir__decode_order2  0
 #define stbir__decode_order3  3
-#define stbir__encode_order0  2 
+#define stbir__encode_order0  2
 #define stbir__encode_order1  1
 #define stbir__encode_order2  0
 #define stbir__encode_order3  3
@@ -3922,11 +4019,11 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot
 
 #define stbir__decode_suffix ARGB
 #define stbir__decode_swizzle
-#define stbir__decode_order0  1 
+#define stbir__decode_order0  1
 #define stbir__decode_order1  2
 #define stbir__decode_order2  3
 #define stbir__decode_order3  0
-#define stbir__encode_order0  3 
+#define stbir__encode_order0  3
 #define stbir__encode_order1  0
 #define stbir__encode_order2  1
 #define stbir__encode_order3  2
@@ -3936,11 +4033,11 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot
 
 #define stbir__decode_suffix ABGR
 #define stbir__decode_swizzle
-#define stbir__decode_order0  3 
+#define stbir__decode_order0  3
 #define stbir__decode_order1  2
 #define stbir__decode_order2  1
 #define stbir__decode_order3  0
-#define stbir__encode_order0  3 
+#define stbir__encode_order0  3
 #define stbir__encode_order1  2
 #define stbir__encode_order2  1
 #define stbir__encode_order3  0
@@ -3950,12 +4047,12 @@ static void stbir__calculate_filters( stbir__sampler * samp, stbir__sampler * ot
 
 #define stbir__decode_suffix AR
 #define stbir__decode_swizzle
-#define stbir__decode_order0  1 
-#define stbir__decode_order1  0 
+#define stbir__decode_order0  1
+#define stbir__decode_order1  0
 #define stbir__decode_order2  3
 #define stbir__decode_order3  2
-#define stbir__encode_order0  1 
-#define stbir__encode_order1  0 
+#define stbir__encode_order0  1
+#define stbir__encode_order1  0
 #define stbir__encode_order2  3
 #define stbir__encode_order3  2
 #define stbir__coder_min_num 2
@@ -3973,9 +4070,10 @@ static void stbir__fancy_alpha_weight_4ch( float * out_buffer, int width_times_c
   // fancy alpha is stored internally as R G B A Rpm Gpm Bpm
 
   #ifdef STBIR_SIMD
-  
+
   #ifdef STBIR_SIMD8
   decode += 16;
+  STBIR_NO_UNROLL_LOOP_START
   while ( decode <= end_decode )
   {
     stbir__simdf8 d0,d1,a0,a1,p0,p1;
@@ -3998,8 +4096,9 @@ static void stbir__fancy_alpha_weight_4ch( float * out_buffer, int width_times_c
     out += 28;
   }
   decode -= 16;
-  #else  
+  #else
   decode += 8;
+  STBIR_NO_UNROLL_LOOP_START
   while ( decode <= end_decode )
   {
     stbir__simdf d0,a0,d1,a1,p0,p1;
@@ -4022,12 +4121,14 @@ static void stbir__fancy_alpha_weight_4ch( float * out_buffer, int width_times_c
 
   // might be one last odd pixel
   #ifdef STBIR_SIMD8
+  STBIR_NO_UNROLL_LOOP_START
   while ( decode < end_decode )
   #else
   if ( decode < end_decode )
   #endif
   {
     stbir__simdf d,a,p;
+    STBIR_NO_UNROLL(decode);
     stbir__simdf_load( d, decode );
     stbir__simdf_0123to3333( a, d );
     stbir__simdf_mult( p, a, d );
@@ -4069,6 +4170,7 @@ static void stbir__fancy_alpha_weight_2ch( float * out_buffer, int width_times_c
   decode += 8;
   if ( decode <= end_decode )
   {
+    STBIR_NO_UNROLL_LOOP_START
     do {
       #ifdef STBIR_SIMD8
       stbir__simdf8 d0,a0,p0;
@@ -4077,11 +4179,11 @@ static void stbir__fancy_alpha_weight_2ch( float * out_buffer, int width_times_c
       stbir__simdf8_0123to11331133( p0, d0 );
       stbir__simdf8_0123to00220022( a0, d0 );
       stbir__simdf8_mult( p0, p0, a0 );
- 
+
       stbir__simdf_store2( out, stbir__if_simdf8_cast_to_simdf4( d0 ) );
       stbir__simdf_store( out+2, stbir__if_simdf8_cast_to_simdf4( p0 ) );
       stbir__simdf_store2h( out+3, stbir__if_simdf8_cast_to_simdf4( d0 ) );
-      
+
       stbir__simdf_store2( out+6, stbir__simdf8_gettop4( d0 ) );
       stbir__simdf_store( out+8, stbir__simdf8_gettop4( p0 ) );
       stbir__simdf_store2h( out+9, stbir__simdf8_gettop4( d0 ) );
@@ -4112,6 +4214,7 @@ static void stbir__fancy_alpha_weight_2ch( float * out_buffer, int width_times_c
   decode -= 8;
   #endif
 
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( decode < end_decode )
   {
     float x = decode[0], y = decode[1];
@@ -4132,6 +4235,7 @@ static void stbir__fancy_alpha_unweight_4ch( float * encode_buffer, int width_ti
 
   // fancy RGBA is stored internally as R G B A Rpm Gpm Bpm
 
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
     float alpha = input[3];
 #ifdef STBIR_SIMD
@@ -4199,6 +4303,7 @@ static void stbir__simple_alpha_weight_4ch( float * decode_buffer, int width_tim
   #ifdef STBIR_SIMD
   {
     decode += 2 * stbir__simdfX_float_count;
+    STBIR_NO_UNROLL_LOOP_START
     while ( decode <= end_decode )
     {
       stbir__simdfX d0,a0,d1,a1;
@@ -4217,6 +4322,7 @@ static void stbir__simple_alpha_weight_4ch( float * decode_buffer, int width_tim
 
     // few last pixels remnants
     #ifdef STBIR_SIMD8
+    STBIR_NO_UNROLL_LOOP_START
     while ( decode < end_decode )
     #else
     if ( decode < end_decode )
@@ -4252,6 +4358,7 @@ static void stbir__simple_alpha_weight_2ch( float * decode_buffer, int width_tim
 
   #ifdef STBIR_SIMD
   decode += 2 * stbir__simdfX_float_count;
+  STBIR_NO_UNROLL_LOOP_START
   while ( decode <= end_decode )
   {
     stbir__simdfX d0,a0,d1,a1;
@@ -4269,6 +4376,7 @@ static void stbir__simple_alpha_weight_2ch( float * decode_buffer, int width_tim
   decode -= 2 * stbir__simdfX_float_count;
   #endif
 
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( decode < end_decode )
   {
     float alpha = decode[1];
@@ -4283,6 +4391,7 @@ static void stbir__simple_alpha_unweight_4ch( float * encode_buffer, int width_t
   float STBIR_SIMD_STREAMOUT_PTR(*) encode = encode_buffer;
   float const * end_output = encode_buffer + width_times_channels;
 
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
     float alpha = encode[3];
 
@@ -4330,9 +4439,77 @@ static void stbir__simple_flip_3ch( float * decode_buffer, int width_times_chann
   float STBIR_STREAMOUT_PTR(*) decode = decode_buffer;
   float const * end_decode = decode_buffer + width_times_channels;
 
-  decode += 12;
+#ifdef STBIR_SIMD
+    #ifdef stbir__simdf_swiz2 // do we have two argument swizzles?
+      end_decode -= 12; 
+      STBIR_NO_UNROLL_LOOP_START
+      while( decode <= end_decode )
+      {
+        // on arm64 8 instructions, no overlapping stores
+        stbir__simdf a,b,c,na,nb;
+        STBIR_SIMD_NO_UNROLL(decode);
+        stbir__simdf_load( a, decode );
+        stbir__simdf_load( b, decode+4 );
+        stbir__simdf_load( c, decode+8 );
+
+        na = stbir__simdf_swiz2( a, b, 2, 1, 0, 5 );   
+        b  = stbir__simdf_swiz2( a, b, 4, 3, 6, 7 );   
+        nb = stbir__simdf_swiz2( b, c, 0, 1, 4, 3 );   
+        c  = stbir__simdf_swiz2( b, c, 2, 7, 6, 5 );   
+
+        stbir__simdf_store( decode, na );
+        stbir__simdf_store( decode+4, nb ); 
+        stbir__simdf_store( decode+8, c );
+        decode += 12;
+      }
+      end_decode += 12;
+    #else
+      end_decode -= 24;
+      STBIR_NO_UNROLL_LOOP_START
+      while( decode <= end_decode )
+      {
+        // 26 instructions on x64
+        stbir__simdf a,b,c,d,e,f,g;
+        float i21, i23;
+        STBIR_SIMD_NO_UNROLL(decode);
+        stbir__simdf_load( a, decode );
+        stbir__simdf_load( b, decode+3 );
+        stbir__simdf_load( c, decode+6 );
+        stbir__simdf_load( d, decode+9 );
+        stbir__simdf_load( e, decode+12 );
+        stbir__simdf_load( f, decode+15 );
+        stbir__simdf_load( g, decode+18 );
+
+        a = stbir__simdf_swiz( a, 2, 1, 0, 3 );   
+        b = stbir__simdf_swiz( b, 2, 1, 0, 3 );   
+        c = stbir__simdf_swiz( c, 2, 1, 0, 3 );   
+        d = stbir__simdf_swiz( d, 2, 1, 0, 3 );   
+        e = stbir__simdf_swiz( e, 2, 1, 0, 3 );   
+        f = stbir__simdf_swiz( f, 2, 1, 0, 3 );   
+        g = stbir__simdf_swiz( g, 2, 1, 0, 3 );   
+
+        // stores overlap, need to be in order, 
+        stbir__simdf_store( decode,    a );
+        i21 = decode[21];
+        stbir__simdf_store( decode+3,  b ); 
+        i23 = decode[23];
+        stbir__simdf_store( decode+6,  c );
+        stbir__simdf_store( decode+9,  d );
+        stbir__simdf_store( decode+12, e );
+        stbir__simdf_store( decode+15, f );
+        stbir__simdf_store( decode+18, g );
+        decode[21] = i23;
+        decode[23] = i21;
+        decode += 24;
+      }
+      end_decode += 24;
+    #endif
+#else
+  end_decode -= 12;
+  STBIR_NO_UNROLL_LOOP_START
   while( decode <= end_decode )
   {
+    // 16 instructions
     float t0,t1,t2,t3;
     STBIR_NO_UNROLL(decode);
     t0 = decode[0]; t1 = decode[3]; t2 = decode[6]; t3 = decode[9];
@@ -4340,8 +4517,10 @@ static void stbir__simple_flip_3ch( float * decode_buffer, int width_times_chann
     decode[2] = t0; decode[5] = t1; decode[8] = t2; decode[11] = t3;
     decode += 12;
   }
-  decode -= 12;
+  end_decode += 12;
+#endif
 
+  STBIR_NO_UNROLL_LOOP_START
   while( decode < end_decode )
   {
     float t = decode[0];
@@ -4362,14 +4541,14 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
   stbir_edge edge_horizontal = stbir_info->horizontal.edge;
   stbir_edge edge_vertical = stbir_info->vertical.edge;
   int row = stbir__edge_wrap(edge_vertical, n, stbir_info->vertical.scale_info.input_full_size);
-  const void* input_plane_data = ( (char *) stbir_info->input_data ) + (ptrdiff_t)row * (ptrdiff_t) stbir_info->input_stride_bytes;
+  const void* input_plane_data = ( (char *) stbir_info->input_data ) + (size_t)row * (size_t) stbir_info->input_stride_bytes;
   stbir__span const * spans = stbir_info->scanline_extents.spans;
   float* full_decode_buffer = output_buffer - stbir_info->scanline_extents.conservative.n0 * effective_channels;
 
   // if we are on edge_zero, and we get in here with an out of bounds n, then the calculate filters has failed
   STBIR_ASSERT( !(edge_vertical == STBIR_EDGE_ZERO && (n < 0 || n >= stbir_info->vertical.scale_info.input_full_size)) );
 
-  do 
+  do
   {
     float * decode_buffer;
     void const * input_data;
@@ -4377,7 +4556,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     int width_times_channels;
     int width;
 
-    if ( spans->n1 < spans->n0 )    
+    if ( spans->n1 < spans->n0 )
       break;
 
     width = spans->n1 + 1 - spans->n0;
@@ -4394,7 +4573,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
       // call the callback with a temp buffer (that they can choose to use or not).  the temp is just right aligned memory in the decode_buffer itself
       input_data = stbir_info->in_pixels_cb( ( (char*) end_decode ) - ( width * input_sample_in_bytes ), input_plane_data, width, spans->pixel_offset_for_input, row, stbir_info->user_data );
     }
-    
+
     STBIR_PROFILE_START( decode );
     // convert the pixels info the float decode_buffer, (we index from end_decode, so that when channels<effective_channels, we are right justified in the buffer)
     stbir_info->decode_pixels( (float*)end_decode - width_times_channels, width_times_channels, input_data );
@@ -4418,7 +4597,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     // this code only runs if we're in edge_wrap, and we're doing the entire scanline
     int e, start_x[2];
     int input_full_size = stbir_info->horizontal.scale_info.input_full_size;
-    
+
     start_x[0] = -stbir_info->scanline_extents.edge_sizes[0];  // left edge start x
     start_x[1] =  input_full_size;                             // right edge
 
@@ -4447,7 +4626,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf tot,c;                \
     STBIR_SIMD_NO_UNROLL(decode);      \
     stbir__simdf_load1( c, hc );       \
-    stbir__simdf_mult1_mem( tot, c, decode ); 
+    stbir__simdf_mult1_mem( tot, c, decode );
 
 #define stbir__2_coeff_only()          \
     stbir__simdf tot,c,d;              \
@@ -4456,7 +4635,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_load2( d, decode );   \
     stbir__simdf_mult( tot, c, d );    \
     stbir__simdf_0123to1230( c, tot ); \
-    stbir__simdf_add1( tot, tot, c );          
+    stbir__simdf_add1( tot, tot, c );
 
 #define stbir__3_coeff_only()                  \
     stbir__simdf tot,c,t;                      \
@@ -4466,7 +4645,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_0123to1230( c, tot );         \
     stbir__simdf_0123to2301( t, tot );         \
     stbir__simdf_add1( tot, tot, c );          \
-    stbir__simdf_add1( tot, tot, t );    
+    stbir__simdf_add1( tot, tot, t );
 
 #define stbir__store_output_tiny()                \
     stbir__simdf_store1( output, tot );           \
@@ -4483,7 +4662,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
 #define stbir__4_coeff_continue_from_4( ofs )  \
     STBIR_SIMD_NO_UNROLL(decode);              \
     stbir__simdf_load( c, hc + (ofs) );        \
-    stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) ); 
+    stbir__simdf_madd_mem( tot, tot, c, decode+(ofs) );
 
 #define stbir__1_coeff_remnant( ofs )          \
     { stbir__simdf d;                          \
@@ -4495,7 +4674,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     { stbir__simdf d;                          \
     stbir__simdf_load2z( c, hc+(ofs) );        \
     stbir__simdf_load2( d, decode+(ofs) );     \
-    stbir__simdf_madd( tot, tot, d, c ); }   
+    stbir__simdf_madd( tot, tot, d, c ); }
 
 #define stbir__3_coeff_setup()                 \
     stbir__simdf mask;                         \
@@ -4520,18 +4699,18 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
 
 #define stbir__1_coeff_only()  \
     float tot;                 \
-    tot = decode[0]*hc[0];     
+    tot = decode[0]*hc[0];
 
 #define stbir__2_coeff_only()  \
     float tot;                 \
     tot = decode[0] * hc[0];   \
-    tot += decode[1] * hc[1];    
+    tot += decode[1] * hc[1];
 
 #define stbir__3_coeff_only()  \
     float tot;                 \
     tot = decode[0] * hc[0];   \
     tot += decode[1] * hc[1];  \
-    tot += decode[2] * hc[2];    
+    tot += decode[2] * hc[2];
 
 #define stbir__store_output_tiny()                \
     output[0] = tot;                              \
@@ -4544,16 +4723,16 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     tot0 = decode[0] * hc[0];   \
     tot1 = decode[1] * hc[1];   \
     tot2 = decode[2] * hc[2];   \
-    tot3 = decode[3] * hc[3];     
+    tot3 = decode[3] * hc[3];
 
 #define stbir__4_coeff_continue_from_4( ofs )  \
     tot0 += decode[0+(ofs)] * hc[0+(ofs)];     \
     tot1 += decode[1+(ofs)] * hc[1+(ofs)];     \
     tot2 += decode[2+(ofs)] * hc[2+(ofs)];     \
-    tot3 += decode[3+(ofs)] * hc[3+(ofs)];     
+    tot3 += decode[3+(ofs)] * hc[3+(ofs)];
 
 #define stbir__1_coeff_remnant( ofs )        \
-    tot0 += decode[0+(ofs)] * hc[0+(ofs)];   
+    tot0 += decode[0+(ofs)] * hc[0+(ofs)];
 
 #define stbir__2_coeff_remnant( ofs )        \
     tot0 += decode[0+(ofs)] * hc[0+(ofs)];   \
@@ -4562,7 +4741,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
 #define stbir__3_coeff_remnant( ofs )        \
     tot0 += decode[0+(ofs)] * hc[0+(ofs)];   \
     tot1 += decode[1+(ofs)] * hc[1+(ofs)];   \
-    tot2 += decode[2+(ofs)] * hc[2+(ofs)];   
+    tot2 += decode[2+(ofs)] * hc[2+(ofs)];
 
 #define stbir__store_output()                     \
     output[0] = (tot0+tot2)+(tot1+tot3);          \
@@ -4570,7 +4749,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     ++horizontal_contributors;                    \
     output += 1;
 
-#endif  
+#endif
 
 #define STBIR__horizontal_channels 1
 #define STB_IMAGE_RESIZE_DO_HORIZONTALS
@@ -4588,14 +4767,14 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_load1z( c, hc );     \
     stbir__simdf_0123to0011( c, c );  \
     stbir__simdf_load2( d, decode );  \
-    stbir__simdf_mult( tot, d, c ); 
+    stbir__simdf_mult( tot, d, c );
 
 #define stbir__2_coeff_only()         \
     stbir__simdf tot,c;               \
     STBIR_SIMD_NO_UNROLL(decode);     \
     stbir__simdf_load2( c, hc );      \
     stbir__simdf_0123to0011( c, c );  \
-    stbir__simdf_mult_mem( tot, c, decode ); 
+    stbir__simdf_mult_mem( tot, c, decode );
 
 #define stbir__3_coeff_only()                \
     stbir__simdf tot,c,cs,d;                 \
@@ -4605,7 +4784,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_mult_mem( tot, c, decode ); \
     stbir__simdf_0123to2222( c, cs );        \
     stbir__simdf_load2z( d, decode+4 );      \
-    stbir__simdf_madd( tot, tot, d, c );   
+    stbir__simdf_madd( tot, tot, d, c );
 
 #define stbir__store_output_tiny()                \
     stbir__simdf_0123to2301( c, tot );            \
@@ -4628,7 +4807,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     STBIR_SIMD_NO_UNROLL(decode);                    \
     stbir__simdf8_load4b( cs, hc + (ofs) );          \
     stbir__simdf8_0123to00112233( c, cs );           \
-    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); 
+    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*2 );
 
 #define stbir__1_coeff_remnant( ofs )                \
     { stbir__simdf t;                                \
@@ -4649,13 +4828,13 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf8_load4b( cs, hc + (ofs) );          \
     stbir__simdf8_0123to00112233( c, cs );           \
     stbir__simdf8_load6z( d, decode+(ofs)*2 );       \
-    stbir__simdf8_madd( tot0, tot0, c, d ); }               
+    stbir__simdf8_madd( tot0, tot0, c, d ); }
 
 #define stbir__store_output()                     \
-    { stbir__simdf t,c;                           \
+    { stbir__simdf t,d;                           \
     stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 );    \
-    stbir__simdf_0123to2301( c, t );              \
-    stbir__simdf_add( t, t, c );                  \
+    stbir__simdf_0123to2301( d, t );              \
+    stbir__simdf_add( t, t, d );                  \
     stbir__simdf_store2( output, t );             \
     horizontal_coefficients += coefficient_width; \
     ++horizontal_contributors;                    \
@@ -4670,7 +4849,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_0123to0011( c, cs );            \
     stbir__simdf_mult_mem( tot0, c, decode );    \
     stbir__simdf_0123to2233( c, cs );            \
-    stbir__simdf_mult_mem( tot1, c, decode+4 );   
+    stbir__simdf_mult_mem( tot1, c, decode+4 );
 
 #define stbir__4_coeff_continue_from_4( ofs )                \
     STBIR_SIMD_NO_UNROLL(decode);                            \
@@ -4678,7 +4857,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_0123to0011( c, cs );                        \
     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 );  \
     stbir__simdf_0123to2233( c, cs );                        \
-    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*2+4 );   
+    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*2+4 );
 
 #define stbir__1_coeff_remnant( ofs )            \
     { stbir__simdf d;                            \
@@ -4690,7 +4869,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
 #define stbir__2_coeff_remnant( ofs )                      \
     stbir__simdf_load2( cs, hc + (ofs) );                  \
     stbir__simdf_0123to0011( c, cs );                      \
-    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 );       
+    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 );
 
 #define stbir__3_coeff_remnant( ofs )                       \
     { stbir__simdf d;                                       \
@@ -4699,7 +4878,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*2 ); \
     stbir__simdf_0123to2222( c, cs );                       \
     stbir__simdf_load2z( d, decode + (ofs) * 2 + 4 );       \
-    stbir__simdf_madd( tot1, tot1, d, c ); }  
+    stbir__simdf_madd( tot1, tot1, d, c ); }
 
 #define stbir__store_output()                     \
     stbir__simdf_add( tot0, tot0, tot1 );         \
@@ -4718,7 +4897,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     float tota,totb,c;         \
     c = hc[0];                 \
     tota = decode[0]*c;        \
-    totb = decode[1]*c;     
+    totb = decode[1]*c;
 
 #define stbir__2_coeff_only()  \
     float tota,totb,c;         \
@@ -4727,7 +4906,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     totb = decode[1]*c;        \
     c = hc[1];                 \
     tota += decode[2]*c;       \
-    totb += decode[3]*c;     
+    totb += decode[3]*c;
 
 // this weird order of add matches the simd
 #define stbir__3_coeff_only()  \
@@ -4740,7 +4919,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     totb += decode[5]*c;       \
     c = hc[1];                 \
     tota += decode[2]*c;       \
-    totb += decode[3]*c;     
+    totb += decode[3]*c;
 
 #define stbir__store_output_tiny()                \
     output[0] = tota;                             \
@@ -4762,7 +4941,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     totb2 = decode[5]*c;            \
     c = hc[3];                      \
     tota3 = decode[6]*c;            \
-    totb3 = decode[7]*c;     
+    totb3 = decode[7]*c;
 
 #define stbir__4_coeff_continue_from_4( ofs )  \
     c = hc[0+(ofs)];                           \
@@ -4776,12 +4955,12 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     totb2 += decode[5+(ofs)*2]*c;              \
     c = hc[3+(ofs)];                           \
     tota3 += decode[6+(ofs)*2]*c;              \
-    totb3 += decode[7+(ofs)*2]*c;     
+    totb3 += decode[7+(ofs)*2]*c;
 
 #define stbir__1_coeff_remnant( ofs )  \
     c = hc[0+(ofs)];                   \
     tota0 += decode[0+(ofs)*2] * c;    \
-    totb0 += decode[1+(ofs)*2] * c;   
+    totb0 += decode[1+(ofs)*2] * c;
 
 #define stbir__2_coeff_remnant( ofs )  \
     c = hc[0+(ofs)];                   \
@@ -4789,7 +4968,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     totb0 += decode[1+(ofs)*2] * c;    \
     c = hc[1+(ofs)];                   \
     tota1 += decode[2+(ofs)*2] * c;    \
-    totb1 += decode[3+(ofs)*2] * c;   
+    totb1 += decode[3+(ofs)*2] * c;
 
 #define stbir__3_coeff_remnant( ofs )  \
     c = hc[0+(ofs)];                   \
@@ -4800,7 +4979,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     totb1 += decode[3+(ofs)*2] * c;    \
     c = hc[2+(ofs)];                   \
     tota2 += decode[4+(ofs)*2] * c;    \
-    totb2 += decode[5+(ofs)*2] * c;    
+    totb2 += decode[5+(ofs)*2] * c;
 
 #define stbir__store_output()                     \
     output[0] = (tota0+tota2)+(tota1+tota3);      \
@@ -4809,7 +4988,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     ++horizontal_contributors;                    \
     output += 2;
 
-#endif  
+#endif
 
 #define STBIR__horizontal_channels 2
 #define STB_IMAGE_RESIZE_DO_HORIZONTALS
@@ -4827,7 +5006,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_load1z( c, hc );     \
     stbir__simdf_0123to0001( c, c );  \
     stbir__simdf_load( d, decode );   \
-    stbir__simdf_mult( tot, d, c ); 
+    stbir__simdf_mult( tot, d, c );
 
 #define stbir__2_coeff_only()         \
     stbir__simdf tot,c,cs,d;          \
@@ -4838,7 +5017,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_mult( tot, d, c );   \
     stbir__simdf_0123to1111( c, cs ); \
     stbir__simdf_load( d, decode+3 ); \
-    stbir__simdf_madd( tot, tot, d, c ); 
+    stbir__simdf_madd( tot, tot, d, c );
 
 #define stbir__3_coeff_only()            \
     stbir__simdf tot,c,d,cs;             \
@@ -4852,7 +5031,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_madd( tot, tot, d, c ); \
     stbir__simdf_0123to2222( c, cs );    \
     stbir__simdf_load( d, decode+6 );    \
-    stbir__simdf_madd( tot, tot, d, c ); 
+    stbir__simdf_madd( tot, tot, d, c );
 
 #define stbir__store_output_tiny()                \
     stbir__simdf_store2( output, tot );           \
@@ -4872,7 +5051,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf8_0123to00001111( c, cs );         \
     stbir__simdf8_mult_mem( tot0, c, decode - 1 ); \
     stbir__simdf8_0123to22223333( c, cs );         \
-    stbir__simdf8_mult_mem( tot1, c, decode+6 - 1 );    
+    stbir__simdf8_mult_mem( tot1, c, decode+6 - 1 );
 
 #define stbir__4_coeff_continue_from_4( ofs )      \
     STBIR_SIMD_NO_UNROLL(decode);                  \
@@ -4880,26 +5059,26 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf8_0123to00001111( c, cs );         \
     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \
     stbir__simdf8_0123to22223333( c, cs );         \
-    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*3 + 6 - 1 );    
+    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*3 + 6 - 1 );
 
 #define stbir__1_coeff_remnant( ofs )                          \
     STBIR_SIMD_NO_UNROLL(decode);                              \
     stbir__simdf_load1rep4( t, hc + (ofs) );                   \
-    stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*3 - 1 ); 
+    stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*3 - 1 );
 
 #define stbir__2_coeff_remnant( ofs )                          \
     STBIR_SIMD_NO_UNROLL(decode);                              \
     stbir__simdf8_load4b( cs, hc + (ofs) - 2 );                \
     stbir__simdf8_0123to22223333( c, cs );                     \
-    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 );   
- 
+    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 );
+
  #define stbir__3_coeff_remnant( ofs )                           \
     STBIR_SIMD_NO_UNROLL(decode);                                \
     stbir__simdf8_load4b( cs, hc + (ofs) );                      \
     stbir__simdf8_0123to00001111( c, cs );                       \
     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*3 - 1 ); \
     stbir__simdf8_0123to2222( t, cs );                           \
-    stbir__simdf8_madd_mem4( tot1, tot1, t, decode+(ofs)*3 + 6 - 1 ); 
+    stbir__simdf8_madd_mem4( tot1, tot1, t, decode+(ofs)*3 + 6 - 1 );
 
 #define stbir__store_output()                       \
     stbir__simdf8_add( tot0, tot0, tot1 );          \
@@ -4930,7 +5109,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_0123to1122( c, cs );           \
     stbir__simdf_mult_mem( tot1, c, decode+4 ); \
     stbir__simdf_0123to2333( c, cs );           \
-    stbir__simdf_mult_mem( tot2, c, decode+8 ); 
+    stbir__simdf_mult_mem( tot2, c, decode+8 );
 
 #define stbir__4_coeff_continue_from_4( ofs )                 \
     STBIR_SIMD_NO_UNROLL(decode);                             \
@@ -4940,13 +5119,13 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_0123to1122( c, cs );                         \
     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \
     stbir__simdf_0123to2333( c, cs );                         \
-    stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*3+8 );   
+    stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*3+8 );
 
 #define stbir__1_coeff_remnant( ofs )         \
     STBIR_SIMD_NO_UNROLL(decode);             \
     stbir__simdf_load1z( c, hc + (ofs) );     \
     stbir__simdf_0123to0001( c, c );          \
-    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 );   
+    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 );
 
 #define stbir__2_coeff_remnant( ofs )                       \
     { stbir__simdf d;                                       \
@@ -4956,7 +5135,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*3 ); \
     stbir__simdf_0123to1122( c, cs );                       \
     stbir__simdf_load2z( d, decode+(ofs)*3+4 );             \
-    stbir__simdf_madd( tot1, tot1, c, d ); }                 
+    stbir__simdf_madd( tot1, tot1, c, d ); }
 
 #define stbir__3_coeff_remnant( ofs )                         \
     { stbir__simdf d;                                         \
@@ -4968,7 +5147,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*3+4 ); \
     stbir__simdf_0123to2222( c, cs );                         \
     stbir__simdf_load1z( d, decode+(ofs)*3+8 );               \
-    stbir__simdf_madd( tot2, tot2, c, d );  }                
+    stbir__simdf_madd( tot2, tot2, c, d );  }
 
 #define stbir__store_output()                       \
     stbir__simdf_0123ABCDto3ABx( c, tot0, tot1 );   \
@@ -4999,7 +5178,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     c = hc[0];                 \
     tot0 = decode[0]*c;        \
     tot1 = decode[1]*c;        \
-    tot2 = decode[2]*c;              
+    tot2 = decode[2]*c;
 
 #define stbir__2_coeff_only()  \
     float tot0, tot1, tot2, c; \
@@ -5010,7 +5189,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     c = hc[1];                 \
     tot0 += decode[3]*c;       \
     tot1 += decode[4]*c;       \
-    tot2 += decode[5]*c;              
+    tot2 += decode[5]*c;
 
 #define stbir__3_coeff_only()  \
     float tot0, tot1, tot2, c; \
@@ -5025,7 +5204,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     c = hc[2];                 \
     tot0 += decode[6]*c;       \
     tot1 += decode[7]*c;       \
-    tot2 += decode[8]*c;              
+    tot2 += decode[8]*c;
 
 #define stbir__store_output_tiny()                \
     output[0] = tot0;                             \
@@ -5052,7 +5231,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     c = hc[3];                      \
     totd0 = decode[9]*c;            \
     totd1 = decode[10]*c;           \
-    totd2 = decode[11]*c;            
+    totd2 = decode[11]*c;
 
 #define stbir__4_coeff_continue_from_4( ofs )  \
     c = hc[0+(ofs)];                           \
@@ -5070,7 +5249,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     c = hc[3+(ofs)];                           \
     totd0 += decode[9+(ofs)*3]*c;              \
     totd1 += decode[10+(ofs)*3]*c;             \
-    totd2 += decode[11+(ofs)*3]*c;              
+    totd2 += decode[11+(ofs)*3]*c;
 
 #define stbir__1_coeff_remnant( ofs )  \
     c = hc[0+(ofs)];                   \
@@ -5100,7 +5279,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     c = hc[2+(ofs)];                   \
     totc0 += decode[6+(ofs)*3]*c;      \
     totc1 += decode[7+(ofs)*3]*c;      \
-    totc2 += decode[8+(ofs)*3]*c;              
+    totc2 += decode[8+(ofs)*3]*c;
 
 #define stbir__store_output()                     \
     output[0] = (tota0+totc0)+(totb0+totd0);      \
@@ -5110,7 +5289,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     ++horizontal_contributors;                    \
     output += 3;
 
-#endif  
+#endif
 
 #define STBIR__horizontal_channels 3
 #define STB_IMAGE_RESIZE_DO_HORIZONTALS
@@ -5126,7 +5305,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     STBIR_SIMD_NO_UNROLL(decode);         \
     stbir__simdf_load1( c, hc );          \
     stbir__simdf_0123to0000( c, c );      \
-    stbir__simdf_mult_mem( tot, c, decode ); 
+    stbir__simdf_mult_mem( tot, c, decode );
 
 #define stbir__2_coeff_only()                       \
     stbir__simdf tot,c,cs;                          \
@@ -5135,7 +5314,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_0123to0000( c, cs );               \
     stbir__simdf_mult_mem( tot, c, decode );        \
     stbir__simdf_0123to1111( c, cs );               \
-    stbir__simdf_madd_mem( tot, tot, c, decode+4 ); 
+    stbir__simdf_madd_mem( tot, tot, c, decode+4 );
 
 #define stbir__3_coeff_only()                       \
     stbir__simdf tot,c,cs;                          \
@@ -5146,7 +5325,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_0123to1111( c, cs );               \
     stbir__simdf_madd_mem( tot, tot, c, decode+4 ); \
     stbir__simdf_0123to2222( c, cs );               \
-    stbir__simdf_madd_mem( tot, tot, c, decode+8 ); 
+    stbir__simdf_madd_mem( tot, tot, c, decode+8 );
 
 #define stbir__store_output_tiny()                \
     stbir__simdf_store( output, tot );            \
@@ -5163,7 +5342,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf8_0123to00001111( c, cs );         \
     stbir__simdf8_mult_mem( tot0, c, decode );     \
     stbir__simdf8_0123to22223333( c, cs );         \
-    stbir__simdf8_madd_mem( tot0, tot0, c, decode+8 );    
+    stbir__simdf8_madd_mem( tot0, tot0, c, decode+8 );
 
 #define stbir__4_coeff_continue_from_4( ofs )                  \
     STBIR_SIMD_NO_UNROLL(decode);                              \
@@ -5171,26 +5350,26 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf8_0123to00001111( c, cs );                     \
     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );   \
     stbir__simdf8_0123to22223333( c, cs );                     \
-    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );    
+    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );
 
 #define stbir__1_coeff_remnant( ofs )                          \
     STBIR_SIMD_NO_UNROLL(decode);                              \
     stbir__simdf_load1rep4( t, hc + (ofs) );                   \
-    stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4 ); 
+    stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4 );
 
 #define stbir__2_coeff_remnant( ofs )                          \
     STBIR_SIMD_NO_UNROLL(decode);                              \
     stbir__simdf8_load4b( cs, hc + (ofs) - 2 );                \
     stbir__simdf8_0123to22223333( c, cs );                     \
-    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );   
- 
+    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );
+
  #define stbir__3_coeff_remnant( ofs )                         \
     STBIR_SIMD_NO_UNROLL(decode);                              \
     stbir__simdf8_load4b( cs, hc + (ofs) );                    \
     stbir__simdf8_0123to00001111( c, cs );                     \
     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*4 );   \
     stbir__simdf8_0123to2222( t, cs );                         \
-    stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4+8 ); 
+    stbir__simdf8_madd_mem4( tot0, tot0, t, decode+(ofs)*4+8 );
 
 #define stbir__store_output()                      \
     stbir__simdf8_add4halves( t, stbir__if_simdf8_cast_to_simdf4(tot0), tot0 );     \
@@ -5199,7 +5378,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     ++horizontal_contributors;                     \
     output += 4;
 
-#else    
+#else
 
 #define stbir__4_coeff_start()                        \
     stbir__simdf tot0,tot1,c,cs;                      \
@@ -5212,7 +5391,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_0123to2222( c, cs );                 \
     stbir__simdf_madd_mem( tot0, tot0, c, decode+8 ); \
     stbir__simdf_0123to3333( c, cs );                 \
-    stbir__simdf_madd_mem( tot1, tot1, c, decode+12 ); 
+    stbir__simdf_madd_mem( tot1, tot1, c, decode+12 );
 
 #define stbir__4_coeff_continue_from_4( ofs )                  \
     STBIR_SIMD_NO_UNROLL(decode);                              \
@@ -5224,13 +5403,13 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_0123to2222( c, cs );                          \
     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4+8 );  \
     stbir__simdf_0123to3333( c, cs );                          \
-    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+12 ); 
+    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+12 );
 
 #define stbir__1_coeff_remnant( ofs )                       \
     STBIR_SIMD_NO_UNROLL(decode);                           \
     stbir__simdf_load1( c, hc + (ofs) );                    \
     stbir__simdf_0123to0000( c, c );                        \
-    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 ); 
+    stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );
 
 #define stbir__2_coeff_remnant( ofs )                         \
     STBIR_SIMD_NO_UNROLL(decode);                             \
@@ -5238,8 +5417,8 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_0123to0000( c, cs );                         \
     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*4 );   \
     stbir__simdf_0123to1111( c, cs );                         \
-    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 ); 
-  
+    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*4+4 );
+
 #define stbir__3_coeff_remnant( ofs )                          \
     STBIR_SIMD_NO_UNROLL(decode);                              \
     stbir__simdf_load( cs, hc + (ofs) );                       \
@@ -5365,7 +5544,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     x0 += decode[0+(ofs)*4] * c;      \
     x1 += decode[1+(ofs)*4] * c;      \
     x2 += decode[2+(ofs)*4] * c;      \
-    x3 += decode[3+(ofs)*4] * c;      
+    x3 += decode[3+(ofs)*4] * c;
 
 #define stbir__2_coeff_remnant( ofs ) \
     STBIR_SIMD_NO_UNROLL(decode);     \
@@ -5378,8 +5557,8 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     y0 += decode[4+(ofs)*4] * c;      \
     y1 += decode[5+(ofs)*4] * c;      \
     y2 += decode[6+(ofs)*4] * c;      \
-    y3 += decode[7+(ofs)*4] * c;    
-  
+    y3 += decode[7+(ofs)*4] * c;
+
 #define stbir__3_coeff_remnant( ofs ) \
     STBIR_SIMD_NO_UNROLL(decode);     \
     c = hc[0+(ofs)];                  \
@@ -5396,7 +5575,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     x0 += decode[8+(ofs)*4] * c;      \
     x1 += decode[9+(ofs)*4] * c;      \
     x2 += decode[10+(ofs)*4] * c;     \
-    x3 += decode[11+(ofs)*4] * c;     
+    x3 += decode[11+(ofs)*4] * c;
 
 #define stbir__store_output()                     \
     output[0] = x0 + y0;                          \
@@ -5407,7 +5586,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     ++horizontal_contributors;                    \
     output += 4;
 
-#endif  
+#endif
 
 #define STBIR__horizontal_channels 4
 #define STB_IMAGE_RESIZE_DO_HORIZONTALS
@@ -5426,7 +5605,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_load1( c, hc );                \
     stbir__simdf_0123to0000( c, c );            \
     stbir__simdf_mult_mem( tot0, c, decode );   \
-    stbir__simdf_mult_mem( tot1, c, decode+3 ); 
+    stbir__simdf_mult_mem( tot1, c, decode+3 );
 
 #define stbir__2_coeff_only()                         \
     stbir__simdf tot0,tot1,c,cs;                      \
@@ -5437,7 +5616,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_mult_mem( tot1, c, decode+3 );       \
     stbir__simdf_0123to1111( c, cs );                 \
     stbir__simdf_madd_mem( tot0, tot0, c, decode+7 ); \
-    stbir__simdf_madd_mem( tot1, tot1, c,decode+10 ); 
+    stbir__simdf_madd_mem( tot1, tot1, c,decode+10 );
 
 #define stbir__3_coeff_only()                           \
     stbir__simdf tot0,tot1,c,cs;                        \
@@ -5451,7 +5630,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_madd_mem( tot1, tot1, c, decode+10 );  \
     stbir__simdf_0123to2222( c, cs );                   \
     stbir__simdf_madd_mem( tot0, tot0, c, decode+14 );  \
-    stbir__simdf_madd_mem( tot1, tot1, c, decode+17 );  
+    stbir__simdf_madd_mem( tot1, tot1, c, decode+17 );
 
 #define stbir__store_output_tiny()                \
     stbir__simdf_store( output+3, tot1 );         \
@@ -5473,7 +5652,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf8_0123to22222222( c, cs );         \
     stbir__simdf8_madd_mem( tot0, tot0, c, decode+14 );  \
     stbir__simdf8_0123to33333333( c, cs );         \
-    stbir__simdf8_madd_mem( tot1, tot1, c, decode+21 );  
+    stbir__simdf8_madd_mem( tot1, tot1, c, decode+21 );
 
 #define stbir__4_coeff_continue_from_4( ofs )                   \
     STBIR_SIMD_NO_UNROLL(decode);                               \
@@ -5485,19 +5664,19 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf8_0123to22222222( c, cs );                      \
     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); \
     stbir__simdf8_0123to33333333( c, cs );                      \
-    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+21 ); 
+    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+21 );
 
 #define stbir__1_coeff_remnant( ofs )                           \
     STBIR_SIMD_NO_UNROLL(decode);                               \
     stbir__simdf8_load1b( c, hc + (ofs) );                      \
-    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );    
+    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );
 
 #define stbir__2_coeff_remnant( ofs )                           \
     STBIR_SIMD_NO_UNROLL(decode);                               \
     stbir__simdf8_load1b( c, hc + (ofs) );                      \
     stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7 );    \
     stbir__simdf8_load1b( c, hc + (ofs)+1 );                    \
-    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 );   
+    stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 );
 
 #define stbir__3_coeff_remnant( ofs )                           \
     STBIR_SIMD_NO_UNROLL(decode);                               \
@@ -5507,7 +5686,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf8_0123to11111111( c, cs );                      \
     stbir__simdf8_madd_mem( tot1, tot1, c, decode+(ofs)*7+7 );  \
     stbir__simdf8_0123to22222222( c, cs );                      \
-    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 ); 
+    stbir__simdf8_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 );
 
 #define stbir__store_output()                     \
     stbir__simdf8_add( tot0, tot0, tot1 );        \
@@ -5540,7 +5719,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_madd_mem( tot1, tot1, c, decode+17 );  \
     stbir__simdf_0123to3333( c, cs );                   \
     stbir__simdf_madd_mem( tot2, tot2, c, decode+21 );  \
-    stbir__simdf_madd_mem( tot3, tot3, c, decode+24 );         
+    stbir__simdf_madd_mem( tot3, tot3, c, decode+24 );
 
 #define stbir__4_coeff_continue_from_4( ofs )                   \
     STBIR_SIMD_NO_UNROLL(decode);                               \
@@ -5556,7 +5735,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 );  \
     stbir__simdf_0123to3333( c, cs );                           \
     stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+21 );  \
-    stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+24 );   
+    stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+24 );
 
 #define stbir__1_coeff_remnant( ofs )                           \
     STBIR_SIMD_NO_UNROLL(decode);                               \
@@ -5573,8 +5752,8 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+3 );   \
     stbir__simdf_0123to1111( c, cs );                           \
     stbir__simdf_madd_mem( tot2, tot2, c, decode+(ofs)*7+7 );   \
-    stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 );  
-  
+    stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 );
+
 #define stbir__3_coeff_remnant( ofs )                           \
     STBIR_SIMD_NO_UNROLL(decode);                               \
     stbir__simdf_load( cs, hc + (ofs) );                        \
@@ -5586,7 +5765,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     stbir__simdf_madd_mem( tot3, tot3, c, decode+(ofs)*7+10 );  \
     stbir__simdf_0123to2222( c, cs );                           \
     stbir__simdf_madd_mem( tot0, tot0, c, decode+(ofs)*7+14 );  \
-    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 );  
+    stbir__simdf_madd_mem( tot1, tot1, c, decode+(ofs)*7+17 );
 
 #define stbir__store_output()                     \
     stbir__simdf_add( tot0, tot0, tot2 );         \
@@ -5610,7 +5789,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     tot3 = decode[3]*c;              \
     tot4 = decode[4]*c;              \
     tot5 = decode[5]*c;              \
-    tot6 = decode[6]*c;              
+    tot6 = decode[6]*c;
 
 #define stbir__2_coeff_only()        \
     float tot0, tot1, tot2, tot3, tot4, tot5, tot6, c; \
@@ -5704,7 +5883,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     y3 += decode[24] * c;         \
     y4 += decode[25] * c;         \
     y5 += decode[26] * c;         \
-    y6 += decode[27] * c; 
+    y6 += decode[27] * c;
 
 #define stbir__4_coeff_continue_from_4( ofs ) \
     STBIR_SIMD_NO_UNROLL(decode);  \
@@ -5739,7 +5918,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     y3 += decode[24+(ofs)*7] * c;  \
     y4 += decode[25+(ofs)*7] * c;  \
     y5 += decode[26+(ofs)*7] * c;  \
-    y6 += decode[27+(ofs)*7] * c; 
+    y6 += decode[27+(ofs)*7] * c;
 
 #define stbir__1_coeff_remnant( ofs ) \
     STBIR_SIMD_NO_UNROLL(decode);  \
@@ -5770,7 +5949,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     y4 += decode[11+(ofs)*7] * c;  \
     y5 += decode[12+(ofs)*7] * c;  \
     y6 += decode[13+(ofs)*7] * c;  \
-  
+
 #define stbir__3_coeff_remnant( ofs ) \
     STBIR_SIMD_NO_UNROLL(decode);  \
     c = hc[0+(ofs)];               \
@@ -5810,7 +5989,7 @@ static void stbir__decode_scanline(stbir__info const * stbir_info, int n, float
     ++horizontal_contributors;                    \
     output += 7;
 
-#endif  
+#endif
 
 #define STBIR__horizontal_channels 7
 #define STB_IMAGE_RESIZE_DO_HORIZONTALS
@@ -5937,7 +6116,7 @@ static void stbir__encode_scanline( stbir__info const * stbir_info, void *output
   // if we have an output callback, we first convert the decode buffer in place (and then hand that to the callback)
   if ( stbir_info->out_pixels_cb )
     output_buffer = encode_buffer;
-  
+
   STBIR_PROFILE_START( encode );
   // convert into the output buffer
   stbir_info->encode_pixels( output_buffer, width_times_channels, encode_buffer );
@@ -5945,7 +6124,7 @@ static void stbir__encode_scanline( stbir__info const * stbir_info, void *output
 
   // if we have an output callback, call it to send the data
   if ( stbir_info->out_pixels_cb )
-    stbir_info->out_pixels_cb( output_buffer_data, num_pixels, row, stbir_info->user_data );
+    stbir_info->out_pixels_cb( output_buffer, num_pixels, row, stbir_info->user_data );
 }
 
 
@@ -6015,7 +6194,7 @@ static void stbir__resample_vertical_gather(stbir__info const * stbir_info, stbi
     stbir__resample_horizontal_gather(stbir_info, encode_buffer, decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
   }
 
-  stbir__encode_scanline( stbir_info, ( (char *) stbir_info->output_data ) + ((ptrdiff_t)n * (ptrdiff_t)stbir_info->output_stride_bytes), 
+  stbir__encode_scanline( stbir_info, ( (char *) stbir_info->output_data ) + ((size_t)n * (size_t)stbir_info->output_stride_bytes),
                           encode_buffer, n  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
 }
 
@@ -6030,7 +6209,7 @@ static void stbir__decode_and_resample_for_vertical_gather_loop(stbir__info cons
   // update new end scanline
   split_info->ring_buffer_last_scanline = n;
 
-  // get ring buffer 
+  // get ring buffer
   ring_buffer_index = (split_info->ring_buffer_begin_index + (split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline)) % stbir_info->ring_buffer_num_entries;
   ring_buffer = stbir__get_ring_buffer_entry(stbir_info, split_info, ring_buffer_index);
 
@@ -6056,7 +6235,7 @@ static void stbir__vertical_gather_loop( stbir__info const * stbir_info, stbir__
 
   // initialize the ring buffer for gathering
   split_info->ring_buffer_begin_index = 0;
-  split_info->ring_buffer_first_scanline = stbir_info->vertical.extent_info.lowest;  
+  split_info->ring_buffer_first_scanline = vertical_contributors->n0;
   split_info->ring_buffer_last_scanline = split_info->ring_buffer_first_scanline - 1; // means "empty"
 
   for (y = start_output_y; y < end_output_y; y++)
@@ -6080,12 +6259,12 @@ static void stbir__vertical_gather_loop( stbir__info const * stbir_info, stbir__
         split_info->ring_buffer_first_scanline++;
         split_info->ring_buffer_begin_index++;
       }
-      
+
       if ( stbir_info->vertical_first )
       {
         float * ring_buffer = stbir__get_ring_buffer_scanline( stbir_info, split_info, ++split_info->ring_buffer_last_scanline );
         // Decode the nth scanline from the source image into the decode buffer.
-        stbir__decode_scanline( stbir_info, split_info->ring_buffer_last_scanline, ring_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 
+        stbir__decode_scanline( stbir_info, split_info->ring_buffer_last_scanline, ring_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
       }
       else
       {
@@ -6108,10 +6287,10 @@ static void stbir__encode_first_scanline_from_scatter(stbir__info const * stbir_
 {
   // evict a scanline out into the output buffer
   float* ring_buffer_entry = stbir__get_ring_buffer_entry(stbir_info, split_info, split_info->ring_buffer_begin_index );
-  
+
   // dump the scanline out
-  stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (ptrdiff_t)split_info->ring_buffer_first_scanline * (ptrdiff_t)stbir_info->output_stride_bytes ), ring_buffer_entry, split_info->ring_buffer_first_scanline  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
-  
+  stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), ring_buffer_entry, split_info->ring_buffer_first_scanline  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
+
   // mark it as empty
   ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER;
 
@@ -6129,10 +6308,10 @@ static void stbir__horizontal_resample_and_encode_first_scanline_from_scatter(st
 
   // Now resample it into the buffer.
   stbir__resample_horizontal_gather( stbir_info, split_info->vertical_buffer, ring_buffer_entry  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
-  
+
   // dump the scanline out
-  stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (ptrdiff_t)split_info->ring_buffer_first_scanline * (ptrdiff_t)stbir_info->output_stride_bytes ), split_info->vertical_buffer, split_info->ring_buffer_first_scanline  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
-  
+  stbir__encode_scanline( stbir_info, ( (char *)stbir_info->output_data ) + ( (size_t)split_info->ring_buffer_first_scanline * (size_t)stbir_info->output_stride_bytes ), split_info->vertical_buffer, split_info->ring_buffer_first_scanline  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
+
   // mark it as empty
   ring_buffer_entry[ 0 ] = STBIR__FLOAT_EMPTY_MARKER;
 
@@ -6172,7 +6351,7 @@ static void stbir__resample_vertical_scatter(stbir__info const * stbir_info, stb
   STBIR_PROFILE_END( vertical );
 }
 
-typedef void stbir__handle_scanline_for_scatter_func(stbir__info const * stbir_info, stbir__per_split_info* split_info); 
+typedef void stbir__handle_scanline_for_scatter_func(stbir__info const * stbir_info, stbir__per_split_info* split_info);
 
 static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir__per_split_info* split_info, int split_count )
 {
@@ -6193,7 +6372,7 @@ static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir_
   end_input_y = split_info[split_count-1].end_input_y;
 
   // adjust for starting offset start_input_y
-  y = start_input_y + stbir_info->vertical.filter_pixel_margin; 
+  y = start_input_y + stbir_info->vertical.filter_pixel_margin;
   vertical_contributors += y ;
   vertical_coefficients += stbir_info->vertical.coefficient_width * y;
 
@@ -6240,7 +6419,7 @@ static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir_
         split_info->start_input_y = y;
       on_first_input_y = 0;
 
-      // clip the region 
+      // clip the region
       if ( out_first_scanline < start_output_y )
       {
         vc += start_output_y - out_first_scanline;
@@ -6253,11 +6432,11 @@ static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir_
       // if very first scanline, init the index
       if (split_info->ring_buffer_begin_index < 0)
         split_info->ring_buffer_begin_index = out_first_scanline - start_output_y;
-      
+
       STBIR_ASSERT( split_info->ring_buffer_begin_index <= out_first_scanline );
 
       // Decode the nth scanline from the source image into the decode buffer.
-      stbir__decode_scanline( stbir_info, y, split_info->decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO ); 
+      stbir__decode_scanline( stbir_info, y, split_info->decode_buffer  STBIR_ONLY_PROFILE_SET_SPLIT_INFO );
 
       // When horizontal first, we resample horizontally into the vertical buffer before we scatter it out
       if ( !stbir_info->vertical_first )
@@ -6269,7 +6448,7 @@ static void stbir__vertical_scatter_loop( stbir__info const * stbir_info, stbir_
       if ( ( ( split_info->ring_buffer_last_scanline - split_info->ring_buffer_first_scanline + 1 ) == stbir_info->ring_buffer_num_entries ) &&
            ( out_last_scanline > split_info->ring_buffer_last_scanline ) )
         handle_scanline_for_scatter( stbir_info, split_info );
-    
+
       // Now the horizontal buffer is ready to write to all ring buffer rows, so do it.
       stbir__resample_vertical_scatter(stbir_info, split_info, out_first_scanline, out_last_scanline, vc, (float*)scanline_scatter_buffer, (float*)scanline_scatter_buffer_end );
 
@@ -6305,7 +6484,7 @@ static void stbir__set_sampler(stbir__sampler * samp, stbir_filter filter, stbir
     if (scale_info->scale >= ( 1.0f - stbir__small_float ) )
     {
       if ( (scale_info->scale <= ( 1.0f + stbir__small_float ) ) && ( STBIR_CEILF(scale_info->pixel_shift) == scale_info->pixel_shift ) )
-        filter = STBIR_FILTER_POINT_SAMPLE;  
+        filter = STBIR_FILTER_POINT_SAMPLE;
       else
         filter = STBIR_DEFAULT_FILTER_UPSAMPLE;
     }
@@ -6313,7 +6492,7 @@ static void stbir__set_sampler(stbir__sampler * samp, stbir_filter filter, stbir
   samp->filter_enum = filter;
 
   STBIR_ASSERT(samp->filter_enum != 0);
-  STBIR_ASSERT((unsigned)samp->filter_enum < STBIR_FILTER_OTHER); 
+  STBIR_ASSERT((unsigned)samp->filter_enum < STBIR_FILTER_OTHER);
   samp->filter_kernel = stbir__builtin_kernels[ filter ];
   samp->filter_support = stbir__builtin_supports[ filter ];
 
@@ -6339,15 +6518,31 @@ static void stbir__set_sampler(stbir__sampler * samp, stbir_filter filter, stbir
   // pre calculate stuff based on the above
   samp->coefficient_width = stbir__get_coefficient_width(samp, samp->is_gather, user_data);
 
+  // filter_pixel_width is the conservative size in pixels of input that affect an output pixel.
+  //   In rare cases (only with 2 pix to 1 pix with the default filters), it's possible that the 
+  //   filter will extend before or after the scanline beyond just one extra entire copy of the 
+  //   scanline (we would hit the edge twice). We don't let you do that, so we clamp the total 
+  //   width to 3x the total of input pixel (once for the scanline, once for the left side 
+  //   overhang, and once for the right side). We only do this for edge mode, since the other 
+  //   modes can just re-edge clamp back in again.
   if ( edge == STBIR_EDGE_WRAP )
-    if ( samp->filter_pixel_width > ( scale_info->input_full_size * 2 ) )  // this can only happen when shrinking to a single pixel
-      samp->filter_pixel_width = scale_info->input_full_size * 2;
+    if ( samp->filter_pixel_width > ( scale_info->input_full_size * 3 ) )
+      samp->filter_pixel_width = scale_info->input_full_size * 3;
 
   // This is how much to expand buffers to account for filters seeking outside
   // the image boundaries.
   samp->filter_pixel_margin = samp->filter_pixel_width / 2;
+  
+  // filter_pixel_margin is the amount that this filter can overhang on just one side of either 
+  //   end of the scanline (left or the right). Since we only allow you to overhang 1 scanline's 
+  //   worth of pixels, we clamp this one side of overhang to the input scanline size. Again, 
+  //   this clamping only happens in rare cases with the default filters (2 pix to 1 pix). 
+  if ( edge == STBIR_EDGE_WRAP )
+    if ( samp->filter_pixel_margin > scale_info->input_full_size )
+      samp->filter_pixel_margin = scale_info->input_full_size;
 
   samp->num_contributors = stbir__get_contributors(samp, samp->is_gather);
+
   samp->contributors_size = samp->num_contributors * sizeof(stbir__contributors);
   samp->coefficients_size = samp->num_contributors * samp->coefficient_width * sizeof(float) + sizeof(float); // extra sizeof(float) is padding
 
@@ -6397,8 +6592,8 @@ static void stbir__get_conservative_extents( stbir__sampler * samp, stbir__contr
     range->n0 = in_first_pixel;
     stbir__calculate_in_pixel_range( &in_first_pixel, &in_last_pixel, (float)output_sub_size, 0, inv_scale, out_shift, input_full_size, edge );
     range->n1 = in_last_pixel;
-     
-    // now go through the margin to the start of area to find bottom 
+
+    // now go through the margin to the start of area to find bottom
     n = range->n0 + 1;
     input_end = -filter_pixel_margin;
     while( n >= input_end )
@@ -6413,7 +6608,7 @@ static void stbir__get_conservative_extents( stbir__sampler * samp, stbir__contr
       --n;
     }
 
-    // now go through the end of the area through the margin to find top 
+    // now go through the end of the area through the margin to find top
     n = range->n1 - 1;
     input_end = n + 1 + filter_pixel_margin;
     while( n <= input_end )
@@ -6462,7 +6657,7 @@ static void stbir__get_split_info( stbir__per_split_info* split_info, int splits
   cur = 0;
   for( i = 0 ; i < splits ; i++ )
   {
-    int each; 
+    int each;
     split_info[i].start_output_y = cur;
     each = left / ( splits - i );
     split_info[i].end_output_y = cur + each;
@@ -6478,7 +6673,7 @@ static void stbir__get_split_info( stbir__per_split_info* split_info, int splits
 static void stbir__free_internal_mem( stbir__info *info )
 {
   #define STBIR__FREE_AND_CLEAR( ptr ) { if ( ptr ) { void * p = (ptr); (ptr) = 0; STBIR_FREE( p, info->user_data); } }
-  
+
   if ( info )
   {
   #ifndef STBIR__SEPARATE_ALLOCATIONS
@@ -6496,16 +6691,16 @@ static void stbir__free_internal_mem( stbir__info *info )
       for( j = 0 ; j < info->alloc_ring_buffer_num_entries ; j++ )
       {
         #ifdef STBIR_SIMD8
-        if ( info->effective_channels == 3 ) 
+        if ( info->effective_channels == 3 )
           --info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer
-        #endif  
+        #endif
         STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers[j] );
       }
 
       #ifdef STBIR_SIMD8
-      if ( info->effective_channels == 3 ) 
+      if ( info->effective_channels == 3 )
         --info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer
-      #endif  
+      #endif
       STBIR__FREE_AND_CLEAR( info->split_info[i].decode_buffer );
       STBIR__FREE_AND_CLEAR( info->split_info[i].ring_buffers );
       STBIR__FREE_AND_CLEAR( info->split_info[i].vertical_buffer );
@@ -6522,7 +6717,7 @@ static void stbir__free_internal_mem( stbir__info *info )
     STBIR__FREE_AND_CLEAR( info );
   #endif
   }
-  
+
   #undef STBIR__FREE_AND_CLEAR
 }
 
@@ -6534,20 +6729,20 @@ static int stbir__get_max_split( int splits, int height )
   for( i = 0 ; i < splits ; i++ )
   {
     int each = height / ( splits - i );
-    if ( each > max ) 
+    if ( each > max )
       max = each;
     height -= each;
   }
   return max;
 }
 
-static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_n_coeffs_funcs[8] = 
-{ 
+static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_n_coeffs_funcs[8] =
+{
   0, stbir__horizontal_gather_1_channels_with_n_coeffs_funcs, stbir__horizontal_gather_2_channels_with_n_coeffs_funcs, stbir__horizontal_gather_3_channels_with_n_coeffs_funcs, stbir__horizontal_gather_4_channels_with_n_coeffs_funcs, 0,0, stbir__horizontal_gather_7_channels_with_n_coeffs_funcs
 };
 
-static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_channels_funcs[8] = 
-{ 
+static stbir__horizontal_gather_channels_func ** stbir__horizontal_gather_channels_funcs[8] =
+{
   0, stbir__horizontal_gather_1_channels_funcs, stbir__horizontal_gather_2_channels_funcs, stbir__horizontal_gather_3_channels_funcs, stbir__horizontal_gather_4_channels_funcs, 0,0, stbir__horizontal_gather_7_channels_funcs
 };
 
@@ -6622,28 +6817,28 @@ static STBIR__V_FIRST_INFO STBIR__V_FIRST_INFO_BUFFER = {0};
 #endif
 
 // Figure out whether to scale along the horizontal or vertical first.
-//   This only *super* important when you are scaling by a massively 
-//   different amount in the vertical vs the horizontal (for example, if 
-//   you are scaling by 2x in the width, and 0.5x in the height, then you 
-//   want to do the vertical scale first, because it's around 3x faster 
+//   This only *super* important when you are scaling by a massively
+//   different amount in the vertical vs the horizontal (for example, if
+//   you are scaling by 2x in the width, and 0.5x in the height, then you
+//   want to do the vertical scale first, because it's around 3x faster
 //   in that order.
 //
-//   In more normal circumstances, this makes a 20-40% differences, so 
+//   In more normal circumstances, this makes a 20-40% differences, so
 //     it's good to get right, but not critical. The normal way that you
-//     decide which direction goes first is just figuring out which 
-//     direction does more multiplies. But with modern CPUs with their 
+//     decide which direction goes first is just figuring out which
+//     direction does more multiplies. But with modern CPUs with their
 //     fancy caches and SIMD and high IPC abilities, so there's just a lot
-//     more that goes into it. 
+//     more that goes into it.
 //
-//   My handwavy sort of solution is to have an app that does a whole 
+//   My handwavy sort of solution is to have an app that does a whole
 //     bunch of timing for both vertical and horizontal first modes,
 //     and then another app that can read lots of these timing files
 //     and try to search for the best weights to use. Dotimings.c
 //     is the app that does a bunch of timings, and vf_train.c is the
-//     app that solves for the best weights (and shows how well it 
+//     app that solves for the best weights (and shows how well it
 //     does currently).
 
-static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLASSIFICATIONS][4], int horizontal_filter_pixel_width, float horizontal_scale, int horizontal_output_size, int vertical_filter_pixel_width, float vertical_scale, int vertical_output_size, int is_gather, STBIR__V_FIRST_INFO * info )    
+static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLASSIFICATIONS][4], int horizontal_filter_pixel_width, float horizontal_scale, int horizontal_output_size, int vertical_filter_pixel_width, float vertical_scale, int vertical_output_size, int is_gather, STBIR__V_FIRST_INFO * info )
 {
   double v_cost, h_cost;
   float * weights;
@@ -6655,15 +6850,15 @@ static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLA
     v_classification = ( vertical_output_size < horizontal_output_size ) ? 6 : 7;
   else if ( vertical_scale <= 1.0f )
     v_classification = ( is_gather ) ? 1 : 0;
-  else if ( vertical_scale <= 2.0f) 
+  else if ( vertical_scale <= 2.0f)
     v_classification = 2;
-  else if ( vertical_scale <= 3.0f) 
+  else if ( vertical_scale <= 3.0f)
     v_classification = 3;
-  else if ( vertical_scale <= 4.0f) 
+  else if ( vertical_scale <= 4.0f)
     v_classification = 5;
-  else 
+  else
     v_classification = 6;
-  
+
   // use the right weights
   weights = weights_table[ v_classification ];
 
@@ -6684,10 +6879,10 @@ static int stbir__should_do_vertical_first( float weights_table[STBIR_RESIZE_CLA
     info->is_gather = is_gather;
   }
 
-  // and this allows us to override everything for testing (see dotiming.c) 
-  if ( ( info ) && ( info->control_v_first ) ) 
+  // and this allows us to override everything for testing (see dotiming.c)
+  if ( ( info ) && ( info->control_v_first ) )
     vertical_first = ( info->control_v_first == 2 ) ? 1 : 0;
-  
+
   return vertical_first;
 }
 
@@ -6699,9 +6894,9 @@ static unsigned char stbir__pixel_channels[] = {
 };
 
 // the internal pixel layout enums are in a different order, so we can easily do range comparisons of types
-//   the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible 
+//   the public pixel layout is ordered in a way that if you cast num_channels (1-4) to the enum, you get something sensible
 static stbir_internal_pixel_layout stbir__pixel_layout_convert_public_to_internal[] = {
-  STBIRI_BGR, STBIRI_1CHANNEL, STBIRI_2CHANNEL, STBIRI_RGB, STBIRI_RGBA, 
+  STBIRI_BGR, STBIRI_1CHANNEL, STBIRI_2CHANNEL, STBIRI_RGB, STBIRI_RGBA,
   STBIRI_4CHANNEL, STBIRI_BGRA, STBIRI_ARGB, STBIRI_ABGR, STBIRI_RA, STBIRI_AR,
   STBIRI_RGBA_PM, STBIRI_BGRA_PM, STBIRI_ARGB_PM, STBIRI_ABGR_PM, STBIRI_RA_PM, STBIRI_AR_PM,
 };
@@ -6712,17 +6907,17 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
 
   stbir__info * info = 0;
   void * alloced = 0;
-  int alloced_total = 0;
+  size_t alloced_total = 0;
   int vertical_first;
   int decode_buffer_size, ring_buffer_length_bytes, ring_buffer_size, vertical_buffer_size, alloc_ring_buffer_num_entries;
 
   int alpha_weighting_type = 0; // 0=none, 1=simple, 2=fancy
-  int conservative_split_output_size = stbir__get_max_split( splits, vertical->scale_info.output_sub_size ); 
-  stbir_internal_pixel_layout input_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ input_pixel_layout_public ];  
+  int conservative_split_output_size = stbir__get_max_split( splits, vertical->scale_info.output_sub_size );
+  stbir_internal_pixel_layout input_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ input_pixel_layout_public ];
   stbir_internal_pixel_layout output_pixel_layout = stbir__pixel_layout_convert_public_to_internal[ output_pixel_layout_public ];
-  int channels = stbir__pixel_channels[ input_pixel_layout ];      
+  int channels = stbir__pixel_channels[ input_pixel_layout ];
   int effective_channels = channels;
-  
+
   // first figure out what type of alpha weighting to use (if any)
   if ( ( horizontal->filter_enum != STBIR_FILTER_POINT_SAMPLE ) || ( vertical->filter_enum != STBIR_FILTER_POINT_SAMPLE ) ) // no alpha weighting on point sampling
   {
@@ -6760,11 +6955,11 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
 
   // sometimes read one float off in some of the unrolled loops (with a weight of zero coeff, so it doesn't have an effect)
   decode_buffer_size = ( conservative->n1 - conservative->n0 + 1 ) * effective_channels * sizeof(float) + sizeof(float); // extra float for padding
-  
+
 #if defined( STBIR__SEPARATE_ALLOCATIONS ) && defined(STBIR_SIMD8)
   if ( effective_channels == 3 )
     decode_buffer_size += sizeof(float); // avx in 3 channel mode needs one float at the start of the buffer (only with separate allocations)
-#endif  
+#endif
 
   ring_buffer_length_bytes = horizontal->scale_info.output_sub_size * effective_channels * sizeof(float) + sizeof(float); // extra float for padding
 
@@ -6803,9 +6998,9 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
     #define STBIR__NEXT_PTR( ptr, size, ntype ) advance_mem = (void*) ( ( ((size_t)advance_mem) + 15 ) & ~15 ); if ( alloced ) ptr = (ntype*)advance_mem; advance_mem = ((char*)advance_mem) + (size);
 #endif
 
-    STBIR__NEXT_PTR( info, sizeof( stbir__info ), stbir__info );      
+    STBIR__NEXT_PTR( info, sizeof( stbir__info ), stbir__info );
 
-    STBIR__NEXT_PTR( info->split_info, sizeof( stbir__per_split_info ) * splits, stbir__per_split_info );      
+    STBIR__NEXT_PTR( info->split_info, sizeof( stbir__per_split_info ) * splits, stbir__per_split_info );
 
     if ( info )
     {
@@ -6820,39 +7015,39 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
 
       info->channels = channels;
       info->effective_channels = effective_channels;
-  
+
       info->offset_x = new_x;
       info->offset_y = new_y;
       info->alloc_ring_buffer_num_entries = alloc_ring_buffer_num_entries;
-      info->ring_buffer_num_entries = 0;  
+      info->ring_buffer_num_entries = 0;
       info->ring_buffer_length_bytes = ring_buffer_length_bytes;
       info->splits = splits;
       info->vertical_first = vertical_first;
 
-      info->input_pixel_layout_internal = input_pixel_layout;  
+      info->input_pixel_layout_internal = input_pixel_layout;
       info->output_pixel_layout_internal = output_pixel_layout;
 
       // setup alpha weight functions
       info->alpha_weight = 0;
       info->alpha_unweight = 0;
-    
+
       // handle alpha weighting functions and overrides
       if ( alpha_weighting_type == 2 )
       {
         // high quality alpha multiplying on the way in, dividing on the way out
-        info->alpha_weight = fancy_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];  
+        info->alpha_weight = fancy_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
         info->alpha_unweight = fancy_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
       }
       else if ( alpha_weighting_type == 4 )
       {
         // fast alpha multiplying on the way in, dividing on the way out
-        info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];  
+        info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
         info->alpha_unweight = simple_alpha_unweights[ output_pixel_layout - STBIRI_RGBA ];
       }
       else if ( alpha_weighting_type == 1 )
       {
         // fast alpha on the way in, leave in premultiplied form on way out
-        info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ]; 
+        info->alpha_weight = simple_alpha_weights[ input_pixel_layout - STBIRI_RGBA ];
       }
       else if ( alpha_weighting_type == 3 )
       {
@@ -6871,7 +7066,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
           info->alpha_weight = stbir__simple_flip_3ch;
       }
 
-    }        
+    }
 
     // get all the per-split buffers
     for( i = 0 ; i < splits ; i++ )
@@ -6883,7 +7078,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
       #ifdef STBIR_SIMD8
       if ( ( info ) && ( effective_channels == 3 ) )
         ++info->split_info[i].decode_buffer; // avx in 3 channel mode needs one float at the start of the buffer
-      #endif  
+      #endif
 
       STBIR__NEXT_PTR( info->split_info[i].ring_buffers, alloc_ring_buffer_num_entries * sizeof(float*), float* );
       {
@@ -6894,7 +7089,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
           #ifdef STBIR_SIMD8
           if ( ( info ) && ( effective_channels == 3 ) )
             ++info->split_info[i].ring_buffers[j]; // avx in 3 channel mode needs one float at the start of the buffer
-          #endif  
+          #endif
         }
       }
 #else
@@ -6922,10 +7117,10 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
 #endif
       if ( temp_mem_amt >= both )
       {
-        if ( info ) 
-        { 
-          vertical->gather_prescatter_contributors = (stbir__contributors*)info->split_info[0].decode_buffer; 
-          vertical->gather_prescatter_coefficients = (float*) ( ( (char*)info->split_info[0].decode_buffer ) + vertical->gather_prescatter_contributors_size ); 
+        if ( info )
+        {
+          vertical->gather_prescatter_contributors = (stbir__contributors*)info->split_info[0].decode_buffer;
+          vertical->gather_prescatter_coefficients = (float*) ( ( (char*)info->split_info[0].decode_buffer ) + vertical->gather_prescatter_contributors_size );
         }
       }
       else
@@ -6948,7 +7143,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
       if ( diff_shift < 0.0f ) diff_shift = -diff_shift;
       if ( ( diff_scale <= stbir__small_float ) && ( diff_shift <= stbir__small_float ) )
       {
-        if ( horizontal->is_gather == vertical->is_gather ) 
+        if ( horizontal->is_gather == vertical->is_gather )
         {
           copy_horizontal = 1;
           goto no_vert_alloc;
@@ -6975,16 +7170,16 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
       // but if the number of coeffs <= 12, use another set of special cases. <=12 coeffs is any enlarging resize, or shrinking resize down to about 1/3 size
       if ( horizontal->extent_info.widest <= 12 )
         info->horizontal_gather_channels = stbir__horizontal_gather_channels_funcs[ effective_channels ][ horizontal->extent_info.widest - 1 ];
-      
+
       info->scanline_extents.conservative.n0 = conservative->n0;
       info->scanline_extents.conservative.n1 = conservative->n1;
-      
+
       // get exact extents
       stbir__get_extents( horizontal, &info->scanline_extents );
 
       // pack the horizontal coeffs
-      horizontal->coefficient_width = stbir__pack_coefficients(horizontal->num_contributors, horizontal->contributors, horizontal->coefficients, horizontal->coefficient_width, horizontal->extent_info.widest, info->scanline_extents.conservative.n1 + 1 );
-      
+      horizontal->coefficient_width = stbir__pack_coefficients(horizontal->num_contributors, horizontal->contributors, horizontal->coefficients, horizontal->coefficient_width, horizontal->extent_info.widest, info->scanline_extents.conservative.n0, info->scanline_extents.conservative.n1 );
+
       STBIR_MEMCPY( &info->horizontal, horizontal, sizeof( stbir__sampler ) );
 
       STBIR_PROFILE_BUILD_END( horizontal );
@@ -7014,36 +7209,27 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
         info->ring_buffer_num_entries = conservative_split_output_size;
       STBIR_ASSERT( info->ring_buffer_num_entries <= info->alloc_ring_buffer_num_entries );
 
-      // a few of the horizontal gather functions read one dword past the end (but mask it out), so put in a normal value so no snans or denormals accidentally sneak in
+      // a few of the horizontal gather functions read past the end of the decode (but mask it out), 
+      //   so put in normal values so no snans or denormals accidentally sneak in (also, in the ring 
+      //   buffer for vertical first)
       for( i = 0 ; i < splits ; i++ )
       {
-        int width, ofs;
-        
-        // find the right most span
-        if ( info->scanline_extents.spans[0].n1 > info->scanline_extents.spans[1].n1 )
-          width = info->scanline_extents.spans[0].n1 - info->scanline_extents.spans[0].n0;
-        else
-          width = info->scanline_extents.spans[1].n1 - info->scanline_extents.spans[1].n0;
-        
-        // this calc finds the exact end of the decoded scanline for all filter modes.
-        //   usually this is just the width * effective channels.  But we have to account 
-        //   for the area to the left of the scanline for wrap filtering and alignment, this 
-        //   is stored as a negative value in info->scanline_extents.conservative.n0. Next,
-        //   we need to skip the exact size of the right hand size filter area (again for
-        //   wrap mode), this is in info->scanline_extents.edge_sizes[1]).
-        ofs = ( width + 1 - info->scanline_extents.conservative.n0 + info->scanline_extents.edge_sizes[1] ) * effective_channels;
-        
-        // place a known, but numerically valid value in the decode buffer
-        info->split_info[i].decode_buffer[ ofs ] = 9999.0f;
+        int t, ofs, start;
+
+        ofs = decode_buffer_size / 4;
+        start = ofs - 4;
+        if ( start < 0 ) start = 0;
+
+        for( t = start ; t < ofs; t++ )
+          info->split_info[i].decode_buffer[ t ] = 9999.0f;
 
-        // if vertical filtering first, place a known, but numerically valid value in the all
-        //   of the ring buffer accumulators
         if ( vertical_first )
         {
-          int j;  
+          int j;
           for( j = 0; j < info->ring_buffer_num_entries ; j++ )
           {
-            stbir__get_ring_buffer_entry( info, info->split_info + i, j )[ ofs ] = 9999.0f;
+            for( t = start ; t < ofs; t++ )
+              stbir__get_ring_buffer_entry( info, info->split_info + i, j )[ t ] = 9999.0f;
           }
         }
       }
@@ -7055,7 +7241,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
     // is this the first time through loop?
     if ( info == 0 )
     {
-      alloced_total = (int) ( 15 + (size_t)advance_mem );
+      alloced_total = ( 15 + (size_t)advance_mem );
       alloced = STBIR_MALLOC( alloced_total, user_data );
       if ( alloced == 0 )
         return 0;
@@ -7065,7 +7251,7 @@ static stbir__info * stbir__alloc_internal_mem_and_build_samplers( stbir__sample
   }
 }
 
-static int stbir__perform_resize( stbir__info const * info, int split_start, int split_count ) 
+static int stbir__perform_resize( stbir__info const * info, int split_start, int split_count )
 {
   stbir__per_split_info * split_info = info->split_info + split_start;
 
@@ -7085,7 +7271,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r
 {
   static stbir__decode_pixels_func * decode_simple[STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
   {
-    /* 1ch-4ch */ stbir__decode_uint8_srgb, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear, 
+    /* 1ch-4ch */ stbir__decode_uint8_srgb, stbir__decode_uint8_srgb, 0, stbir__decode_float_linear, stbir__decode_half_float_linear,
   };
 
   static stbir__decode_pixels_func * decode_alphas[STBIRI_AR-STBIRI_RGBA+1][STBIR_TYPE_HALF_FLOAT-STBIR_TYPE_UINT8_SRGB+1]=
@@ -7148,7 +7334,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r
   stbir_datatype input_type, output_type;
 
   input_type = resize->input_data_type;
-  output_type = resize->output_data_type; 
+  output_type = resize->output_data_type;
   info->input_data = resize->input_pixels;
   info->input_stride_bytes = resize->input_stride_in_bytes;
   info->output_stride_bytes = resize->output_stride_in_bytes;
@@ -7156,7 +7342,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r
   // if we're completely point sampling, then we can turn off SRGB
   if ( ( info->horizontal.filter_enum == STBIR_FILTER_POINT_SAMPLE ) && ( info->vertical.filter_enum == STBIR_FILTER_POINT_SAMPLE ) )
   {
-    if ( ( ( input_type  == STBIR_TYPE_UINT8_SRGB ) || ( input_type  == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) && 
+    if ( ( ( input_type  == STBIR_TYPE_UINT8_SRGB ) || ( input_type  == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) &&
          ( ( output_type == STBIR_TYPE_UINT8_SRGB ) || ( output_type == STBIR_TYPE_UINT8_SRGB_ALPHA ) ) )
     {
       input_type = STBIR_TYPE_UINT8;
@@ -7164,7 +7350,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r
     }
   }
 
-  // recalc the output and input strides  
+  // recalc the output and input strides
   if ( info->input_stride_bytes == 0 )
     info->input_stride_bytes = info->channels * info->horizontal.scale_info.input_full_size * stbir__type_size[input_type];
 
@@ -7172,7 +7358,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r
     info->output_stride_bytes = info->channels * info->horizontal.scale_info.output_sub_size * stbir__type_size[output_type];
 
   // calc offset
-  info->output_data = ( (char*) resize->output_pixels ) + ( (ptrdiff_t) info->offset_y * (ptrdiff_t) resize->output_stride_in_bytes ) + ( info->offset_x * info->channels * stbir__type_size[output_type] );
+  info->output_data = ( (char*) resize->output_pixels ) + ( (size_t) info->offset_y * (size_t) resize->output_stride_in_bytes ) + ( info->offset_x * info->channels * stbir__type_size[output_type] );
 
   info->in_pixels_cb = resize->input_cb;
   info->user_data = resize->user_data;
@@ -7205,7 +7391,7 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r
   if ( ( output_type == STBIR_TYPE_UINT8 ) || ( output_type == STBIR_TYPE_UINT16 ) )
   {
     int non_scaled = 0;
-    
+
     // check if we can run unscaled - 0-255.0/0-65535.0 instead of 0-1.0 (which is a tiny bit faster when doing linear 8->8 or 16->16)
     if ( ( !info->alpha_weight ) && ( !info->alpha_unweight ) ) // don't short circuit when alpha weighting (get everything to 0-1.0 as usual)
       if ( ( ( input_type == STBIR_TYPE_UINT8 ) && ( output_type == STBIR_TYPE_UINT8 ) ) || ( ( input_type == STBIR_TYPE_UINT16 ) && ( output_type == STBIR_TYPE_UINT16 ) ) )
@@ -7225,16 +7411,16 @@ static void stbir__update_info_from_resize( stbir__info * info, STBIR_RESIZE * r
   }
 
   info->input_type = input_type;
-  info->output_type = output_type; 
+  info->output_type = output_type;
   info->decode_pixels = decode_pixels;
-  info->encode_pixels = encode_pixels; 
+  info->encode_pixels = encode_pixels;
 }
 
 static void stbir__clip( int * outx, int * outsubw, int outw, double * u0, double * u1 )
 {
   double per, adj;
   int over;
-  
+
   // do left/top edge
   if ( *outx < 0 )
   {
@@ -7253,7 +7439,7 @@ static void stbir__clip( int * outx, int * outsubw, int outw, double * u0, doubl
     *u1 += adj; // decrease u1
     *outsubw = outw - *outx;
   }
-}    
+}
 
 // converts a double to a rational that has less than one float bit of error (returns 0 if unable to do so)
 static int stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32 *numer, stbir_uint32 *denom, int limit_denom ) // limit_denom (1) or limit numer (0)
@@ -7270,7 +7456,7 @@ static int stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32
   bot = 1 << 25;
 
   // keep refining, but usually stops in a few loops - usually 5 for bad cases
-  for(;;)  
+  for(;;)
   {
     stbir_uint64 est, temp;
 
@@ -7303,13 +7489,13 @@ static int stbir__double_to_rational(double f, stbir_uint32 limit, stbir_uint32
     bot = temp;
 
     // move remainders
-    temp = est * denom_estimate + denom_last; 
-    denom_last = denom_estimate; 
+    temp = est * denom_estimate + denom_last;
+    denom_last = denom_estimate;
     denom_estimate = temp;
 
     // move remainders
-    temp = est * numer_estimate + numer_last; 
-    numer_last = numer_estimate; 
+    temp = est * numer_estimate + numer_last;
+    numer_last = numer_estimate;
     numer_estimate = temp;
   }
 
@@ -7353,11 +7539,11 @@ static int stbir__calculate_region_transform( stbir__scale_info * scale_info, in
 
   output_s = ( (double)output_sub_range) / output_range;
 
-  // figure out the scaling to use 
-  ratio = output_s / input_s; 
+  // figure out the scaling to use
+  ratio = output_s / input_s;
 
   // save scale before clipping
-  scale = ( output_range / input_range ) * ratio; 
+  scale = ( output_range / input_range ) * ratio;
   scale_info->scale = (float)scale;
   scale_info->inv_scale = (float)( 1.0 / scale );
 
@@ -7368,11 +7554,11 @@ static int stbir__calculate_region_transform( stbir__scale_info * scale_info, in
   input_s = input_s1 - input_s0;
 
   // after clipping do we have zero input area?
-  if ( input_s <= stbir__small_float ) 
+  if ( input_s <= stbir__small_float )
     return 0;
 
-  // calculate and store the starting source offsets in output pixel space 
-  scale_info->pixel_shift = (float) ( input_s0 * ratio * output_range ); 
+  // calculate and store the starting source offsets in output pixel space
+  scale_info->pixel_shift = (float) ( input_s0 * ratio * output_range );
 
   scale_info->scale_is_rational = stbir__double_to_rational( scale, ( scale <= 1.0 ) ? output_full_range : input_full_range, &scale_info->scale_numerator, &scale_info->scale_denominator, ( scale >= 1.0 ) );
 
@@ -7389,7 +7575,6 @@ static void stbir__init_and_set_layout( STBIR_RESIZE * resize, stbir_pixel_layou
   resize->output_cb = 0;
   resize->user_data = resize;
   resize->samplers = 0;
-  resize->needs_rebuild = 1;
   resize->called_alloc = 0;
   resize->horizontal_filter = STBIR_FILTER_DEFAULT;
   resize->horizontal_filter_kernel = 0; resize->horizontal_filter_support = 0;
@@ -7403,9 +7588,10 @@ static void stbir__init_and_set_layout( STBIR_RESIZE * resize, stbir_pixel_layou
   resize->output_data_type = data_type;
   resize->input_pixel_layout_public = pixel_layout;
   resize->output_pixel_layout_public = pixel_layout;
+  resize->needs_rebuild = 1;
 }
 
-STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize, 
+STBIRDEF void stbir_resize_init( STBIR_RESIZE * resize,
                                  const void *input_pixels,  int input_w,  int input_h, int input_stride_in_bytes, // stride can be zero
                                        void *output_pixels, int output_w, int output_h, int output_stride_in_bytes, // stride can be zero
                                  stbir_pixel_layout pixel_layout, stbir_datatype data_type )
@@ -7428,17 +7614,27 @@ STBIRDEF void stbir_set_datatypes( STBIR_RESIZE * resize, stbir_datatype input_t
 {
   resize->input_data_type = input_type;
   resize->output_data_type = output_type;
+  if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
+    stbir__update_info_from_resize( resize->samplers, resize );
 }
 
 STBIRDEF void stbir_set_pixel_callbacks( STBIR_RESIZE * resize, stbir_input_callback * input_cb, stbir_output_callback * output_cb )   // no callbacks by default
 {
   resize->input_cb = input_cb;
   resize->output_cb = output_cb;
+
+  if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
+  {
+    resize->samplers->in_pixels_cb = input_cb;
+    resize->samplers->out_pixels_cb = output_cb;
+  }
 }
 
 STBIRDEF void stbir_set_user_data( STBIR_RESIZE * resize, void * user_data )                                     // pass back STBIR_RESIZE* by default
 {
   resize->user_data = user_data;
+  if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
+    resize->samplers->user_data = user_data;
 }
 
 STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_pixels, int input_stride_in_bytes, void * output_pixels, int output_stride_in_bytes )
@@ -7447,6 +7643,8 @@ STBIRDEF void stbir_set_buffer_ptrs( STBIR_RESIZE * resize, const void * input_p
   resize->input_stride_in_bytes = input_stride_in_bytes;
   resize->output_pixels = output_pixels;
   resize->output_stride_in_bytes = output_stride_in_bytes;
+  if ( ( resize->samplers ) && ( !resize->needs_rebuild ) )
+    stbir__update_info_from_resize( resize->samplers, resize );
 }
 
 
@@ -7549,7 +7747,7 @@ STBIRDEF int stbir_set_pixel_subrect( STBIR_RESIZE * resize, int subx, int suby,
   return 1;
 }
 
-static int stbir__perform_build( STBIR_RESIZE * resize, int splits ) 
+static int stbir__perform_build( STBIR_RESIZE * resize, int splits )
 {
   stbir__contributors conservative = { 0, 0 };
   stbir__sampler horizontal, vertical;
@@ -7563,13 +7761,13 @@ static int stbir__perform_build( STBIR_RESIZE * resize, int splits )
   // have we already built the samplers?
   if ( resize->samplers )
     return 0;
-  
+
   #define STBIR_RETURN_ERROR_AND_ASSERT( exp )  STBIR_ASSERT( !(exp) ); if (exp) return 0;
   STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->horizontal_filter >= STBIR_FILTER_OTHER)
   STBIR_RETURN_ERROR_AND_ASSERT( (unsigned)resize->vertical_filter >= STBIR_FILTER_OTHER)
   #undef STBIR_RETURN_ERROR_AND_ASSERT
 
-  if ( splits <= 0 ) 
+  if ( splits <= 0 )
     return 0;
 
   STBIR_PROFILE_BUILD_FIRST_START( build );
@@ -7593,9 +7791,9 @@ static int stbir__perform_build( STBIR_RESIZE * resize, int splits )
   stbir__get_conservative_extents( &horizontal, &conservative, resize->user_data );
   stbir__set_sampler(&vertical, resize->vertical_filter, resize->horizontal_filter_kernel, resize->vertical_filter_support, resize->vertical_edge, &vertical.scale_info, 0, resize->user_data );
 
-  if ( ( vertical.scale_info.output_sub_size / splits ) < 4 ) // each split should be a minimum of 4 scanlines (handwavey choice)
+  if ( ( vertical.scale_info.output_sub_size / splits ) < STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS ) // each split should be a minimum of 4 scanlines (handwavey choice)
   {
-    splits = vertical.scale_info.output_sub_size / 4;
+    splits = vertical.scale_info.output_sub_size / STBIR_FORCE_MINIMUM_SCANLINES_FOR_SPLITS;
     if ( splits == 0 ) splits = 1;
   }
 
@@ -7603,7 +7801,7 @@ static int stbir__perform_build( STBIR_RESIZE * resize, int splits )
   out_info = stbir__alloc_internal_mem_and_build_samplers( &horizontal, &vertical, &conservative, resize->input_pixel_layout_public, resize->output_pixel_layout_public, splits, new_output_subx, new_output_suby, resize->fast_alpha, resize->user_data STBIR_ONLY_PROFILE_BUILD_SET_INFO );
   STBIR_PROFILE_BUILD_END( alloc );
   STBIR_PROFILE_BUILD_END( build );
- 
+
   if ( out_info )
   {
     resize->splits = splits;
@@ -7612,6 +7810,10 @@ static int stbir__perform_build( STBIR_RESIZE * resize, int splits )
     #ifdef STBIR_PROFILE
       STBIR_MEMCPY( &out_info->profile, &profile_infod.profile, sizeof( out_info->profile ) );
     #endif
+
+    // update anything that can be changed without recalcing samplers
+    stbir__update_info_from_resize( out_info, resize );
+
     return splits;
   }
 
@@ -7640,7 +7842,7 @@ STBIRDEF int stbir_build_samplers_with_splits( STBIR_RESIZE * resize, int splits
   }
 
   STBIR_PROFILE_BUILD_CLEAR( resize->samplers );
-  
+
   return 1;
 }
 
@@ -7652,7 +7854,7 @@ STBIRDEF int stbir_build_samplers( STBIR_RESIZE * resize )
 STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize )
 {
   int result;
-  
+
   if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
   {
     int alloc_state = resize->called_alloc;  // remember allocated state
@@ -7665,10 +7867,10 @@ STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize )
 
     if ( !stbir_build_samplers( resize ) )
       return 0;
-    
+
     resize->called_alloc = alloc_state;
 
-    // if build_samplers succeeded (above), but there are no samplers set, then 
+    // if build_samplers succeeded (above), but there are no samplers set, then
     //   the area to stretch into was zero pixels, so don't do anything and return
     //   success
     if ( resize->samplers == 0 )
@@ -7680,10 +7882,6 @@ STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize )
     STBIR_PROFILE_BUILD_CLEAR( resize->samplers );
   }
 
-
-  // update anything that can be changed without recalcing samplers
-  stbir__update_info_from_resize( resize->samplers, resize );
-
   // do resize
   result = stbir__perform_resize( resize->samplers, 0, resize->splits );
 
@@ -7692,7 +7890,7 @@ STBIRDEF int stbir_resize_extended( STBIR_RESIZE * resize )
   {
     stbir_free_samplers( resize );
     resize->samplers = 0;
-  } 
+  }
 
   return result;
 }
@@ -7707,14 +7905,11 @@ STBIRDEF int stbir_resize_extended_split( STBIR_RESIZE * resize, int split_start
 
   // you **must** build samplers first when using split resize
   if ( ( resize->samplers == 0 ) || ( resize->needs_rebuild ) )
-    return 0; 
-    
+    return 0;
+
   if ( ( split_start >= resize->splits ) || ( split_start < 0 ) || ( ( split_start + split_count ) > resize->splits ) || ( split_count <= 0 ) )
     return 0;
-  
-  // update anything that can be changed without recalcing samplers
-  stbir__update_info_from_resize( resize->samplers, resize );
- 
+
   // do resize
   return stbir__perform_resize( resize->samplers, split_start, split_count );
 }
@@ -7735,7 +7930,7 @@ static int stbir__check_output_stuff( void ** ret_ptr, int * ret_pitch, void * o
   if ( output_stride_in_bytes < pitch )
     return 0;
 
-  size = output_stride_in_bytes * output_h;
+  size = (size_t)output_stride_in_bytes * (size_t)output_h;
   if ( size == 0 )
     return 0;
 
@@ -7752,7 +7947,7 @@ static int stbir__check_output_stuff( void ** ret_ptr, int * ret_pitch, void * o
     *ret_pitch = pitch;
   }
 
-  return 1;  
+  return 1;
 }
 
 
@@ -7767,9 +7962,9 @@ STBIRDEF unsigned char * stbir_resize_uint8_linear( const unsigned char *input_p
   if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
     return 0;
 
-  stbir_resize_init( &resize, 
-                     input_pixels,  input_w,  input_h,  input_stride_in_bytes, 
-                     (optr) ? optr : output_pixels, output_w, output_h, opitch, 
+  stbir_resize_init( &resize,
+                     input_pixels,  input_w,  input_h,  input_stride_in_bytes,
+                     (optr) ? optr : output_pixels, output_w, output_h, opitch,
                      pixel_layout, STBIR_TYPE_UINT8 );
 
   if ( !stbir_resize_extended( &resize ) )
@@ -7793,9 +7988,9 @@ STBIRDEF unsigned char * stbir_resize_uint8_srgb( const unsigned char *input_pix
   if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( unsigned char ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
     return 0;
 
-  stbir_resize_init( &resize, 
-                     input_pixels,  input_w,  input_h,  input_stride_in_bytes, 
-                     (optr) ? optr : output_pixels, output_w, output_h, opitch, 
+  stbir_resize_init( &resize,
+                     input_pixels,  input_w,  input_h,  input_stride_in_bytes,
+                     (optr) ? optr : output_pixels, output_w, output_h, opitch,
                      pixel_layout, STBIR_TYPE_UINT8_SRGB );
 
   if ( !stbir_resize_extended( &resize ) )
@@ -7820,9 +8015,9 @@ STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int inpu
   if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, sizeof( float ), output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
     return 0;
 
-  stbir_resize_init( &resize, 
-                     input_pixels,  input_w,  input_h,  input_stride_in_bytes, 
-                     (optr) ? optr : output_pixels, output_w, output_h, opitch, 
+  stbir_resize_init( &resize,
+                     input_pixels,  input_w,  input_h,  input_stride_in_bytes,
+                     (optr) ? optr : output_pixels, output_w, output_h, opitch,
                      pixel_layout, STBIR_TYPE_FLOAT );
 
   if ( !stbir_resize_extended( &resize ) )
@@ -7838,7 +8033,7 @@ STBIRDEF float * stbir_resize_float_linear( const float *input_pixels , int inpu
 
 STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int input_h, int input_stride_in_bytes,
                                     void *output_pixels, int output_w, int output_h, int output_stride_in_bytes,
-                              stbir_pixel_layout pixel_layout, stbir_datatype data_type, 
+                              stbir_pixel_layout pixel_layout, stbir_datatype data_type,
                               stbir_edge edge, stbir_filter filter )
 {
   STBIR_RESIZE resize;
@@ -7848,9 +8043,9 @@ STBIRDEF void * stbir_resize( const void *input_pixels , int input_w , int input
   if ( !stbir__check_output_stuff( (void**)&optr, &opitch, output_pixels, stbir__type_size[data_type], output_w, output_h, output_stride_in_bytes, stbir__pixel_layout_convert_public_to_internal[ pixel_layout ] ) )
     return 0;
 
-  stbir_resize_init( &resize, 
-                     input_pixels,  input_w,  input_h,  input_stride_in_bytes, 
-                     (optr) ? optr : output_pixels, output_w, output_h, output_stride_in_bytes, 
+  stbir_resize_init( &resize,
+                     input_pixels,  input_w,  input_h,  input_stride_in_bytes,
+                     (optr) ? optr : output_pixels, output_w, output_h, output_stride_in_bytes,
                      pixel_layout, data_type );
 
   resize.horizontal_edge = edge;
@@ -7958,7 +8153,7 @@ STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * info, STB
 #else  // STB_IMAGE_RESIZE_HORIZONTALS&STB_IMAGE_RESIZE_DO_VERTICALS
 
 // we reinclude the header file to define all the horizontal functions
-//   specializing each function for the number of coeffs is 20-40% faster *OVERALL* 
+//   specializing each function for the number of coeffs is 20-40% faster *OVERALL*
 
 // by including the header file again this way, we can still debug the functions
 
@@ -7991,16 +8186,16 @@ STBIRDEF void stbir_resize_extended_profile_info( STBIR_PROFILE_INFO * info, STB
 #define stbir__encode_order2 2
 #define stbir__encode_order3 3
 #define stbir__decode_simdf8_flip(reg)
-#define stbir__decode_simdf4_flip(reg) 
+#define stbir__decode_simdf4_flip(reg)
 #define stbir__encode_simdf8_unflip(reg)
-#define stbir__encode_simdf4_unflip(reg) 
+#define stbir__encode_simdf4_unflip(reg)
 #endif
 
 #ifdef STBIR_SIMD8
 #define stbir__encode_simdfX_unflip  stbir__encode_simdf8_unflip
 #else
 #define stbir__encode_simdfX_unflip  stbir__encode_simdf4_unflip
-#endif 
+#endif
 
 static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * decodep, int width_times_channels, void const * inputp )
 {
@@ -8013,6 +8208,7 @@ static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * deco
   if ( width_times_channels >= 16 )
   {
     decode_end -= 16;
+    STBIR_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       #ifdef STBIR_SIMD8
@@ -8054,7 +8250,7 @@ static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * deco
       #endif
       decode += 16;
       input += 16;
-      if ( decode <= decode_end ) 
+      if ( decode <= decode_end )
         continue;
       if ( decode == ( decode_end + 16 ) )
         break;
@@ -8068,6 +8264,7 @@ static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * deco
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   decode += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( decode <= decode_end )
   {
     STBIR_SIMD_NO_UNROLL(decode);
@@ -8083,6 +8280,7 @@ static void STBIR__CODER_NAME( stbir__decode_uint8_linear_scaled )( float * deco
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( decode < decode_end )
   {
     STBIR_NO_UNROLL(decode);
@@ -8109,6 +8307,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu
   {
     float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
     end_output -= stbir__simdfX_float_count*2;
+    STBIR_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       stbir__simdfX e0, e1;
@@ -8119,15 +8318,15 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu
       stbir__encode_simdfX_unflip( e0 );
       stbir__encode_simdfX_unflip( e1 );
       #ifdef STBIR_SIMD8
-      stbir__simdf8_pack_to_16bytes( i, e0, e1 ); 
+      stbir__simdf8_pack_to_16bytes( i, e0, e1 );
       stbir__simdi_store( output, i );
       #else
-      stbir__simdf_pack_to_8bytes( i, e0, e1 ); 
+      stbir__simdf_pack_to_8bytes( i, e0, e1 );
       stbir__simdi_store2( output, i );
       #endif
       encode += stbir__simdfX_float_count*2;
       output += stbir__simdfX_float_count*2;
-      if ( output <= end_output ) 
+      if ( output <= end_output )
         continue;
       if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
         break;
@@ -8140,6 +8339,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   output += 4;
+  STBIR_NO_UNROLL_LOOP_START
   while( output <= end_output )
   {
     stbir__simdf e0;
@@ -8158,9 +8358,10 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( output < end_output )
   {
-    stbir__simdf e0; 
+    stbir__simdf e0;
     STBIR_NO_UNROLL(encode);
     stbir__simdf_madd1_mem( e0, STBIR__CONSTF(STBIR_simd_point5), STBIR__CONSTF(STBIR_max_uint8_as_float), encode+stbir__encode_order0 ); output[0] = stbir__simdf_convert_float_to_uint8( e0 );
     #if stbir__coder_min_num >= 2
@@ -8173,7 +8374,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu
     encode += stbir__coder_min_num;
   }
   #endif
-  
+
   #else
 
   // try to do blocks of 4 when you can
@@ -8194,6 +8395,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear_scaled )( void * outpu
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( output < end_output )
   {
     float f;
@@ -8223,6 +8425,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int
   if ( width_times_channels >= 16 )
   {
     decode_end -= 16;
+    STBIR_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       #ifdef STBIR_SIMD8
@@ -8258,7 +8461,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int
 #endif
       decode += 16;
       input += 16;
-      if ( decode <= decode_end ) 
+      if ( decode <= decode_end )
         continue;
       if ( decode == ( decode_end + 16 ) )
         break;
@@ -8272,6 +8475,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   decode += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( decode <= decode_end )
   {
     STBIR_SIMD_NO_UNROLL(decode);
@@ -8287,6 +8491,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_linear)( float * decodep, int
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( decode < decode_end )
   {
     STBIR_NO_UNROLL(decode);
@@ -8313,6 +8518,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int
   {
     float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
     end_output -= stbir__simdfX_float_count*2;
+    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       stbir__simdfX e0, e1;
@@ -8323,15 +8529,15 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int
       stbir__encode_simdfX_unflip( e0 );
       stbir__encode_simdfX_unflip( e1 );
       #ifdef STBIR_SIMD8
-      stbir__simdf8_pack_to_16bytes( i, e0, e1 ); 
+      stbir__simdf8_pack_to_16bytes( i, e0, e1 );
       stbir__simdi_store( output, i );
       #else
-      stbir__simdf_pack_to_8bytes( i, e0, e1 ); 
+      stbir__simdf_pack_to_8bytes( i, e0, e1 );
       stbir__simdi_store2( output, i );
       #endif
       encode += stbir__simdfX_float_count*2;
       output += stbir__simdfX_float_count*2;
-      if ( output <= end_output ) 
+      if ( output <= end_output )
         continue;
       if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
         break;
@@ -8344,6 +8550,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   output += 4;
+  STBIR_NO_UNROLL_LOOP_START
   while( output <= end_output )
   {
     stbir__simdf e0;
@@ -8382,6 +8589,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_linear )( void * outputp, int
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( output < end_output )
   {
     float f;
@@ -8422,6 +8630,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int wi
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( decode < decode_end )
   {
     STBIR_NO_UNROLL(decode);
@@ -8441,7 +8650,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int wi
 #define stbir__min_max_shift20( i, f ) \
     stbir__simdf_max( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_zero )) ); \
     stbir__simdf_min( f, f, stbir_simdf_casti(STBIR__CONSTI( STBIR_almost_one  )) ); \
-    stbir__simdi_32shr( i, stbir_simdi_castf( f ), 20 ); 
+    stbir__simdi_32shr( i, stbir_simdi_castf( f ), 20 );
 
 #define stbir__scale_and_convert( i, f ) \
     stbir__simdf_madd( f, STBIR__CONSTF( STBIR_simd_point5 ), STBIR__CONSTF( STBIR_max_uint8_as_float ), f ); \
@@ -8468,7 +8677,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int wi
   temp1.m128i_u32[0] = table[temp1.m128i_i32[0]]; temp1.m128i_u32[1] = table[temp1.m128i_i32[1]]; temp1.m128i_u32[2] = table[temp1.m128i_i32[2]]; temp1.m128i_u32[3] = table[temp1.m128i_i32[3]]; \
   v0 = temp0.m128i_i128; \
   v1 = temp1.m128i_i128; \
-} 
+}
 
 #define stbir__simdi_table_lookup3( v0,v1,v2, table ) \
 { \
@@ -8499,7 +8708,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_srgb)( float * decodep, int wi
   v1 = temp1.m128i_i128; \
   v2 = temp2.m128i_i128; \
   v3 = temp3.m128i_i128; \
-} 
+}
 
 static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int width_times_channels, float const * encode )
 {
@@ -8507,16 +8716,16 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int w
   unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
 
   #ifdef STBIR_SIMD
-  stbir_uint32 const * to_srgb = fp32_to_srgb8_tab4 - (127-13)*8;
 
   if ( width_times_channels >= 16 )
   {
     float const * end_encode_m16 = encode + width_times_channels - 16;
     end_output -= 16;
+    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       stbir__simdf f0, f1, f2, f3;
-      stbir__simdi i0, i1, i2, i3; 
+      stbir__simdi i0, i1, i2, i3;
       STBIR_SIMD_NO_UNROLL(encode);
 
       stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
@@ -8525,9 +8734,9 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int w
       stbir__min_max_shift20( i1, f1 );
       stbir__min_max_shift20( i2, f2 );
       stbir__min_max_shift20( i3, f3 );
-      
-      stbir__simdi_table_lookup4( i0, i1, i2, i3, to_srgb );
-     
+
+      stbir__simdi_table_lookup4( i0, i1, i2, i3, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
+
       stbir__linear_to_srgb_finish( i0, f0 );
       stbir__linear_to_srgb_finish( i1, f1 );
       stbir__linear_to_srgb_finish( i2, f2 );
@@ -8537,7 +8746,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int w
 
       encode += 16;
       output += 16;
-      if ( output <= end_output ) 
+      if ( output <= end_output )
         continue;
       if ( output == ( end_output + 16 ) )
         break;
@@ -8551,6 +8760,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int w
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   output += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while ( output <= end_output )
   {
     STBIR_SIMD_NO_UNROLL(encode);
@@ -8568,7 +8778,8 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb )( void * outputp, int w
 
   // do the remnants
   #if stbir__coder_min_num < 4
-  while( output < end_output ) 
+  STBIR_NO_UNROLL_LOOP_START
+  while( output < end_output )
   {
     STBIR_NO_UNROLL(encode);
     output[0] = stbir__linear_to_srgb_uchar( encode[stbir__encode_order0] );
@@ -8608,12 +8819,12 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * o
   unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
 
   #ifdef STBIR_SIMD
-  stbir_uint32 const * to_srgb = fp32_to_srgb8_tab4 - (127-13)*8;
 
   if ( width_times_channels >= 16 )
   {
     float const * end_encode_m16 = encode + width_times_channels - 16;
     end_output -= 16;
+    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       stbir__simdf f0, f1, f2, f3;
@@ -8625,10 +8836,10 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * o
       stbir__min_max_shift20( i0, f0 );
       stbir__min_max_shift20( i1, f1 );
       stbir__min_max_shift20( i2, f2 );
-      stbir__scale_and_convert( i3, f3 ); 
-      
-      stbir__simdi_table_lookup3( i0, i1, i2, to_srgb );
-     
+      stbir__scale_and_convert( i3, f3 );
+
+      stbir__simdi_table_lookup3( i0, i1, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
+
       stbir__linear_to_srgb_finish( i0, f0 );
       stbir__linear_to_srgb_finish( i1, f1 );
       stbir__linear_to_srgb_finish( i2, f2 );
@@ -8638,7 +8849,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * o
       output += 16;
       encode += 16;
 
-      if ( output <= end_output ) 
+      if ( output <= end_output )
         continue;
       if ( output == ( end_output + 16 ) )
         break;
@@ -8649,9 +8860,10 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb4_linearalpha )( void * o
   }
   #endif
 
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
     float f;
-    STBIR_SIMD_NO_UNROLL(encode);                                        
+    STBIR_SIMD_NO_UNROLL(encode);
 
     output[stbir__decode_order0] = stbir__linear_to_srgb_uchar( encode[0] );
     output[stbir__decode_order1] = stbir__linear_to_srgb_uchar( encode[1] );
@@ -8686,7 +8898,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint8_srgb2_linearalpha)( float * de
     decode += 4;
   }
   decode -= 4;
-  if( decode < decode_end ) 
+  if( decode < decode_end )
   {
     decode[0] = stbir__srgb_uchar_to_linear_float[ stbir__decode_order0 ];
     decode[1] = ( (float) input[stbir__decode_order1] ) * stbir__max_uint8_as_float_inverted;
@@ -8699,16 +8911,16 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * o
   unsigned char * end_output = ( (unsigned char*) output ) + width_times_channels;
 
   #ifdef STBIR_SIMD
-  stbir_uint32 const * to_srgb = fp32_to_srgb8_tab4 - (127-13)*8;
 
   if ( width_times_channels >= 16 )
   {
     float const * end_encode_m16 = encode + width_times_channels - 16;
     end_output -= 16;
+    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       stbir__simdf f0, f1, f2, f3;
-      stbir__simdi i0, i1, i2, i3; 
+      stbir__simdi i0, i1, i2, i3;
 
       STBIR_SIMD_NO_UNROLL(encode);
       stbir__simdf_load4_transposed( f0, f1, f2, f3, encode );
@@ -8717,9 +8929,9 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * o
       stbir__scale_and_convert( i1, f1 );
       stbir__min_max_shift20( i2, f2 );
       stbir__scale_and_convert( i3, f3 );
-      
-      stbir__simdi_table_lookup2( i0, i2, to_srgb );
-     
+
+      stbir__simdi_table_lookup2( i0, i2, ( fp32_to_srgb8_tab4 - (127-13)*8 ) );
+
       stbir__linear_to_srgb_finish( i0, f0 );
       stbir__linear_to_srgb_finish( i2, f2 );
 
@@ -8727,7 +8939,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * o
 
       output += 16;
       encode += 16;
-      if ( output <= end_output ) 
+      if ( output <= end_output )
         continue;
       if ( output == ( end_output + 16 ) )
         break;
@@ -8738,6 +8950,7 @@ static void STBIR__CODER_NAME( stbir__encode_uint8_srgb2_linearalpha )( void * o
   }
   #endif
 
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
     float f;
     STBIR_SIMD_NO_UNROLL(encode);
@@ -8766,6 +8979,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decod
   if ( width_times_channels >= 8 )
   {
     decode_end -= 8;
+    STBIR_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       #ifdef STBIR_SIMD8
@@ -8793,9 +9007,9 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decod
       stbir__simdf_store( decode + 0,  of0 );
       stbir__simdf_store( decode + 4,  of1 );
       #endif
-      decode += 8;  
+      decode += 8;
       input += 8;
-      if ( decode <= decode_end ) 
+      if ( decode <= decode_end )
         continue;
       if ( decode == ( decode_end + 8 ) )
         break;
@@ -8809,6 +9023,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decod
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   decode += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( decode <= decode_end )
   {
     STBIR_SIMD_NO_UNROLL(decode);
@@ -8824,6 +9039,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear_scaled)( float * decod
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( decode < decode_end )
   {
     STBIR_NO_UNROLL(decode);
@@ -8852,6 +9068,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output
     {
       float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
       end_output -= stbir__simdfX_float_count*2;
+      STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
       for(;;)
       {
         stbir__simdfX e0, e1;
@@ -8865,7 +9082,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output
         stbir__simdiX_store( output, i );
         encode += stbir__simdfX_float_count*2;
         output += stbir__simdfX_float_count*2;
-        if ( output <= end_output ) 
+        if ( output <= end_output )
           continue;
         if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
           break;
@@ -8879,6 +9096,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   output += 4;
+  STBIR_NO_UNROLL_LOOP_START
   while( output <= end_output )
   {
     stbir__simdf e;
@@ -8897,6 +9115,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( output < end_output )
   {
     stbir__simdf e;
@@ -8912,12 +9131,13 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output
     encode += stbir__coder_min_num;
   }
   #endif
-  
+
   #else
 
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   output += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( output <= end_output )
   {
     float f;
@@ -8934,6 +9154,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear_scaled)( void * output
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( output < end_output )
   {
     float f;
@@ -8963,6 +9184,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int
   if ( width_times_channels >= 8 )
   {
     decode_end -= 8;
+    STBIR_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       #ifdef STBIR_SIMD8
@@ -8989,7 +9211,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int
       #endif
       decode += 8;
       input += 8;
-      if ( decode <= decode_end ) 
+      if ( decode <= decode_end )
         continue;
       if ( decode == ( decode_end + 8 ) )
         break;
@@ -9003,6 +9225,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   decode += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( decode <= decode_end )
   {
     STBIR_SIMD_NO_UNROLL(decode);
@@ -9018,6 +9241,7 @@ static void STBIR__CODER_NAME(stbir__decode_uint16_linear)( float * decodep, int
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( decode < decode_end )
   {
     STBIR_NO_UNROLL(decode);
@@ -9045,6 +9269,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int
     {
       float const * end_encode_m8 = encode + width_times_channels - stbir__simdfX_float_count*2;
       end_output -= stbir__simdfX_float_count*2;
+      STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
       for(;;)
       {
         stbir__simdfX e0, e1;
@@ -9058,7 +9283,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int
         stbir__simdiX_store( output, i );
         encode += stbir__simdfX_float_count*2;
         output += stbir__simdfX_float_count*2;
-        if ( output <= end_output ) 
+        if ( output <= end_output )
           continue;
         if ( output == ( end_output + stbir__simdfX_float_count*2 ) )
           break;
@@ -9072,6 +9297,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   output += 4;
+  STBIR_NO_UNROLL_LOOP_START
   while( output <= end_output )
   {
     stbir__simdf e;
@@ -9093,6 +9319,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int
   // try to do blocks of 4 when you can
   #if  stbir__coder_min_num != 3 // doesn't divide cleanly by four
   output += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( output <= end_output )
   {
     float f;
@@ -9111,6 +9338,7 @@ static void STBIR__CODER_NAME(stbir__encode_uint16_linear)( void * outputp, int
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( output < end_output )
   {
     float f;
@@ -9139,6 +9367,7 @@ static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep,
   {
     stbir__FP16 const * end_input_m8 = input + width_times_channels - 8;
     decode_end -= 8;
+    STBIR_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       STBIR_NO_UNROLL(decode);
@@ -9166,7 +9395,7 @@ static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep,
       #endif
       decode += 8;
       input += 8;
-      if ( decode <= decode_end ) 
+      if ( decode <= decode_end )
         continue;
       if ( decode == ( decode_end + 8 ) )
         break;
@@ -9180,6 +9409,7 @@ static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep,
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   decode += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( decode <= decode_end )
   {
     STBIR_SIMD_NO_UNROLL(decode);
@@ -9195,6 +9425,7 @@ static void STBIR__CODER_NAME(stbir__decode_half_float_linear)( float * decodep,
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( decode < decode_end )
   {
     STBIR_NO_UNROLL(decode);
@@ -9221,6 +9452,7 @@ static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp
   {
     float const * end_encode_m8 = encode + width_times_channels - 8;
     end_output -= 8;
+    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       STBIR_SIMD_NO_UNROLL(encode);
@@ -9247,7 +9479,7 @@ static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp
       #endif
       encode += 8;
       output += 8;
-      if ( output <= end_output ) 
+      if ( output <= end_output )
         continue;
       if ( output == ( end_output + 8 ) )
         break;
@@ -9261,6 +9493,7 @@ static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   output += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( output <= end_output )
   {
     STBIR_SIMD_NO_UNROLL(output);
@@ -9276,6 +9509,7 @@ static void STBIR__CODER_NAME( stbir__encode_half_float_linear )( void * outputp
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( output < end_output )
   {
     STBIR_NO_UNROLL(output);
@@ -9304,6 +9538,7 @@ static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int
   {
     float const * end_input_m16 = input + width_times_channels - 16;
     decode_end -= 16;
+    STBIR_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       STBIR_NO_UNROLL(decode);
@@ -9338,7 +9573,7 @@ static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int
       #endif
       decode += 16;
       input += 16;
-      if ( decode <= decode_end ) 
+      if ( decode <= decode_end )
         continue;
       if ( decode == ( decode_end + 16 ) )
         break;
@@ -9352,6 +9587,7 @@ static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   decode += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( decode <= decode_end )
   {
     STBIR_SIMD_NO_UNROLL(decode);
@@ -9367,6 +9603,7 @@ static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( decode < decode_end )
   {
     STBIR_NO_UNROLL(decode);
@@ -9383,10 +9620,10 @@ static void STBIR__CODER_NAME(stbir__decode_float_linear)( float * decodep, int
   #endif
 
   #else
-  
+
   if ( (void*)decodep != inputp )
     STBIR_MEMCPY( decodep, inputp, width_times_channels * sizeof( float ) );
-  
+
   #endif
 }
 
@@ -9426,6 +9663,7 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int
   {
     float const * end_encode_m8 = encode + width_times_channels - ( stbir__simdfX_float_count * 2 );
     end_output -= ( stbir__simdfX_float_count * 2 );
+    STBIR_SIMD_NO_UNROLL_LOOP_START_INF_FOR
     for(;;)
     {
       stbir__simdfX e0, e1;
@@ -9435,18 +9673,18 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int
 #ifdef STBIR_FLOAT_HIGH_CLAMP
       stbir__simdfX_min( e0, e0, high_clamp );
       stbir__simdfX_min( e1, e1, high_clamp );
-#endif      
+#endif
 #ifdef STBIR_FLOAT_LOW_CLAMP
       stbir__simdfX_max( e0, e0, low_clamp );
       stbir__simdfX_max( e1, e1, low_clamp );
-#endif      
+#endif
       stbir__encode_simdfX_unflip( e0 );
       stbir__encode_simdfX_unflip( e1 );
       stbir__simdfX_store( output, e0 );
       stbir__simdfX_store( output+stbir__simdfX_float_count, e1 );
       encode += stbir__simdfX_float_count * 2;
       output += stbir__simdfX_float_count * 2;
-      if ( output < end_output ) 
+      if ( output < end_output )
         continue;
       if ( output == ( end_output + ( stbir__simdfX_float_count * 2 ) ) )
         break;
@@ -9459,6 +9697,7 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   output += 4;
+  STBIR_NO_UNROLL_LOOP_START
   while( output <= end_output )
   {
     stbir__simdf e0;
@@ -9466,10 +9705,10 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int
     stbir__simdf_load( e0, encode );
 #ifdef STBIR_FLOAT_HIGH_CLAMP
     stbir__simdf_min( e0, e0, high_clamp );
-#endif      
+#endif
 #ifdef STBIR_FLOAT_LOW_CLAMP
     stbir__simdf_max( e0, e0, low_clamp );
-#endif      
+#endif
     stbir__encode_simdf4_unflip( e0 );
     stbir__simdf_store( output-4, e0 );
     output += 4;
@@ -9483,6 +9722,7 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int
   // try to do blocks of 4 when you can
   #if stbir__coder_min_num != 3 // doesn't divide cleanly by four
   output += 4;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   while( output <= end_output )
   {
     float e;
@@ -9502,6 +9742,7 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int
 
   // do the remnants
   #if stbir__coder_min_num < 4
+  STBIR_NO_UNROLL_LOOP_START
   while( output < end_output )
   {
     float e;
@@ -9517,18 +9758,18 @@ static void STBIR__CODER_NAME( stbir__encode_float_linear )( void * outputp, int
     encode += stbir__coder_min_num;
   }
   #endif
-  
+
   #endif
 }
 
-#undef stbir__decode_suffix 
+#undef stbir__decode_suffix
 #undef stbir__decode_simdf8_flip
 #undef stbir__decode_simdf4_flip
-#undef stbir__decode_order0 
+#undef stbir__decode_order0
 #undef stbir__decode_order1
 #undef stbir__decode_order2
 #undef stbir__decode_order3
-#undef stbir__encode_order0 
+#undef stbir__encode_order0
 #undef stbir__encode_order1
 #undef stbir__encode_order2
 #undef stbir__encode_order3
@@ -9612,7 +9853,8 @@ static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** output
     stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); )
     stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); )
     stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); )
-    while ( ( (char*)input_end - (char*) input ) >= (16*stbir__simdfX_float_count) ) 
+    STBIR_SIMD_NO_UNROLL_LOOP_START
+    while ( ( (char*)input_end - (char*) input ) >= (16*stbir__simdfX_float_count) )
     {
       stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3;
       STBIR_SIMD_NO_UNROLL(output0);
@@ -9621,52 +9863,53 @@ static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** output
 
       #ifdef STB_IMAGE_RESIZE_VERTICAL_CONTINUE
       stbIF0( stbir__simdfX_load( o0, output0 );     stbir__simdfX_load( o1, output0+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output0+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output0+(3*stbir__simdfX_float_count) );
-              stbir__simdfX_madd( o0, o0, r0, c0 );  stbir__simdfX_madd( o1, o1, r1, c0 );  stbir__simdfX_madd( o2, o2, r2, c0 );   stbir__simdfX_madd( o3, o3, r3, c0 );           
+              stbir__simdfX_madd( o0, o0, r0, c0 );  stbir__simdfX_madd( o1, o1, r1, c0 );  stbir__simdfX_madd( o2, o2, r2, c0 );   stbir__simdfX_madd( o3, o3, r3, c0 );
               stbir__simdfX_store( output0, o0 );    stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); )
       stbIF1( stbir__simdfX_load( o0, output1 );     stbir__simdfX_load( o1, output1+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output1+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output1+(3*stbir__simdfX_float_count) );
-              stbir__simdfX_madd( o0, o0, r0, c1 );  stbir__simdfX_madd( o1, o1, r1, c1 );  stbir__simdfX_madd( o2, o2, r2, c1 );   stbir__simdfX_madd( o3, o3, r3, c1 );             
+              stbir__simdfX_madd( o0, o0, r0, c1 );  stbir__simdfX_madd( o1, o1, r1, c1 );  stbir__simdfX_madd( o2, o2, r2, c1 );   stbir__simdfX_madd( o3, o3, r3, c1 );
               stbir__simdfX_store( output1, o0 );    stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); )
       stbIF2( stbir__simdfX_load( o0, output2 );     stbir__simdfX_load( o1, output2+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output2+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output2+(3*stbir__simdfX_float_count) );
-              stbir__simdfX_madd( o0, o0, r0, c2 );  stbir__simdfX_madd( o1, o1, r1, c2 );  stbir__simdfX_madd( o2, o2, r2, c2 );   stbir__simdfX_madd( o3, o3, r3, c2 );             
+              stbir__simdfX_madd( o0, o0, r0, c2 );  stbir__simdfX_madd( o1, o1, r1, c2 );  stbir__simdfX_madd( o2, o2, r2, c2 );   stbir__simdfX_madd( o3, o3, r3, c2 );
               stbir__simdfX_store( output2, o0 );    stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); )
       stbIF3( stbir__simdfX_load( o0, output3 );     stbir__simdfX_load( o1, output3+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output3+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output3+(3*stbir__simdfX_float_count) );
-              stbir__simdfX_madd( o0, o0, r0, c3 );  stbir__simdfX_madd( o1, o1, r1, c3 );  stbir__simdfX_madd( o2, o2, r2, c3 );   stbir__simdfX_madd( o3, o3, r3, c3 );             
+              stbir__simdfX_madd( o0, o0, r0, c3 );  stbir__simdfX_madd( o1, o1, r1, c3 );  stbir__simdfX_madd( o2, o2, r2, c3 );   stbir__simdfX_madd( o3, o3, r3, c3 );
               stbir__simdfX_store( output3, o0 );    stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); )
       stbIF4( stbir__simdfX_load( o0, output4 );     stbir__simdfX_load( o1, output4+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output4+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output4+(3*stbir__simdfX_float_count) );
-              stbir__simdfX_madd( o0, o0, r0, c4 );  stbir__simdfX_madd( o1, o1, r1, c4 );  stbir__simdfX_madd( o2, o2, r2, c4 );   stbir__simdfX_madd( o3, o3, r3, c4 );             
+              stbir__simdfX_madd( o0, o0, r0, c4 );  stbir__simdfX_madd( o1, o1, r1, c4 );  stbir__simdfX_madd( o2, o2, r2, c4 );   stbir__simdfX_madd( o3, o3, r3, c4 );
               stbir__simdfX_store( output4, o0 );    stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); )
       stbIF5( stbir__simdfX_load( o0, output5 );     stbir__simdfX_load( o1, output5+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output5+(2*stbir__simdfX_float_count));    stbir__simdfX_load( o3, output5+(3*stbir__simdfX_float_count) );
-              stbir__simdfX_madd( o0, o0, r0, c5 );  stbir__simdfX_madd( o1, o1, r1, c5 );  stbir__simdfX_madd( o2, o2, r2, c5 );   stbir__simdfX_madd( o3, o3, r3, c5 );             
+              stbir__simdfX_madd( o0, o0, r0, c5 );  stbir__simdfX_madd( o1, o1, r1, c5 );  stbir__simdfX_madd( o2, o2, r2, c5 );   stbir__simdfX_madd( o3, o3, r3, c5 );
               stbir__simdfX_store( output5, o0 );    stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); )
       stbIF6( stbir__simdfX_load( o0, output6 );     stbir__simdfX_load( o1, output6+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output6+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output6+(3*stbir__simdfX_float_count) );
-              stbir__simdfX_madd( o0, o0, r0, c6 );  stbir__simdfX_madd( o1, o1, r1, c6 );  stbir__simdfX_madd( o2, o2, r2, c6 );   stbir__simdfX_madd( o3, o3, r3, c6 );             
+              stbir__simdfX_madd( o0, o0, r0, c6 );  stbir__simdfX_madd( o1, o1, r1, c6 );  stbir__simdfX_madd( o2, o2, r2, c6 );   stbir__simdfX_madd( o3, o3, r3, c6 );
               stbir__simdfX_store( output6, o0 );    stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); )
       stbIF7( stbir__simdfX_load( o0, output7 );     stbir__simdfX_load( o1, output7+stbir__simdfX_float_count );   stbir__simdfX_load( o2, output7+(2*stbir__simdfX_float_count) );    stbir__simdfX_load( o3, output7+(3*stbir__simdfX_float_count) );
-              stbir__simdfX_madd( o0, o0, r0, c7 );  stbir__simdfX_madd( o1, o1, r1, c7 );  stbir__simdfX_madd( o2, o2, r2, c7 );   stbir__simdfX_madd( o3, o3, r3, c7 );             
+              stbir__simdfX_madd( o0, o0, r0, c7 );  stbir__simdfX_madd( o1, o1, r1, c7 );  stbir__simdfX_madd( o2, o2, r2, c7 );   stbir__simdfX_madd( o3, o3, r3, c7 );
               stbir__simdfX_store( output7, o0 );    stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); )
       #else
-      stbIF0( stbir__simdfX_mult( o0, r0, c0 );      stbir__simdfX_mult( o1, r1, c0 );      stbir__simdfX_mult( o2, r2, c0 );       stbir__simdfX_mult( o3, r3, c0 );  
+      stbIF0( stbir__simdfX_mult( o0, r0, c0 );      stbir__simdfX_mult( o1, r1, c0 );      stbir__simdfX_mult( o2, r2, c0 );       stbir__simdfX_mult( o3, r3, c0 );
               stbir__simdfX_store( output0, o0 );    stbir__simdfX_store( output0+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output0+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output0+(3*stbir__simdfX_float_count), o3 ); )
-      stbIF1( stbir__simdfX_mult( o0, r0, c1 );      stbir__simdfX_mult( o1, r1, c1 );      stbir__simdfX_mult( o2, r2, c1 );       stbir__simdfX_mult( o3, r3, c1 );  
+      stbIF1( stbir__simdfX_mult( o0, r0, c1 );      stbir__simdfX_mult( o1, r1, c1 );      stbir__simdfX_mult( o2, r2, c1 );       stbir__simdfX_mult( o3, r3, c1 );
               stbir__simdfX_store( output1, o0 );    stbir__simdfX_store( output1+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output1+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output1+(3*stbir__simdfX_float_count), o3 ); )
-      stbIF2( stbir__simdfX_mult( o0, r0, c2 );      stbir__simdfX_mult( o1, r1, c2 );      stbir__simdfX_mult( o2, r2, c2 );       stbir__simdfX_mult( o3, r3, c2 );  
+      stbIF2( stbir__simdfX_mult( o0, r0, c2 );      stbir__simdfX_mult( o1, r1, c2 );      stbir__simdfX_mult( o2, r2, c2 );       stbir__simdfX_mult( o3, r3, c2 );
               stbir__simdfX_store( output2, o0 );    stbir__simdfX_store( output2+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output2+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output2+(3*stbir__simdfX_float_count), o3 ); )
-      stbIF3( stbir__simdfX_mult( o0, r0, c3 );      stbir__simdfX_mult( o1, r1, c3 );      stbir__simdfX_mult( o2, r2, c3 );       stbir__simdfX_mult( o3, r3, c3 );  
+      stbIF3( stbir__simdfX_mult( o0, r0, c3 );      stbir__simdfX_mult( o1, r1, c3 );      stbir__simdfX_mult( o2, r2, c3 );       stbir__simdfX_mult( o3, r3, c3 );
               stbir__simdfX_store( output3, o0 );    stbir__simdfX_store( output3+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output3+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output3+(3*stbir__simdfX_float_count), o3 ); )
-      stbIF4( stbir__simdfX_mult( o0, r0, c4 );      stbir__simdfX_mult( o1, r1, c4 );      stbir__simdfX_mult( o2, r2, c4 );       stbir__simdfX_mult( o3, r3, c4 );  
+      stbIF4( stbir__simdfX_mult( o0, r0, c4 );      stbir__simdfX_mult( o1, r1, c4 );      stbir__simdfX_mult( o2, r2, c4 );       stbir__simdfX_mult( o3, r3, c4 );
               stbir__simdfX_store( output4, o0 );    stbir__simdfX_store( output4+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output4+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output4+(3*stbir__simdfX_float_count), o3 ); )
-      stbIF5( stbir__simdfX_mult( o0, r0, c5 );      stbir__simdfX_mult( o1, r1, c5 );      stbir__simdfX_mult( o2, r2, c5 );       stbir__simdfX_mult( o3, r3, c5 );  
+      stbIF5( stbir__simdfX_mult( o0, r0, c5 );      stbir__simdfX_mult( o1, r1, c5 );      stbir__simdfX_mult( o2, r2, c5 );       stbir__simdfX_mult( o3, r3, c5 );
               stbir__simdfX_store( output5, o0 );    stbir__simdfX_store( output5+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output5+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output5+(3*stbir__simdfX_float_count), o3 ); )
-      stbIF6( stbir__simdfX_mult( o0, r0, c6 );      stbir__simdfX_mult( o1, r1, c6 );      stbir__simdfX_mult( o2, r2, c6 );       stbir__simdfX_mult( o3, r3, c6 );  
+      stbIF6( stbir__simdfX_mult( o0, r0, c6 );      stbir__simdfX_mult( o1, r1, c6 );      stbir__simdfX_mult( o2, r2, c6 );       stbir__simdfX_mult( o3, r3, c6 );
               stbir__simdfX_store( output6, o0 );    stbir__simdfX_store( output6+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output6+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output6+(3*stbir__simdfX_float_count), o3 ); )
-      stbIF7( stbir__simdfX_mult( o0, r0, c7 );      stbir__simdfX_mult( o1, r1, c7 );      stbir__simdfX_mult( o2, r2, c7 );       stbir__simdfX_mult( o3, r3, c7 );  
+      stbIF7( stbir__simdfX_mult( o0, r0, c7 );      stbir__simdfX_mult( o1, r1, c7 );      stbir__simdfX_mult( o2, r2, c7 );       stbir__simdfX_mult( o3, r3, c7 );
               stbir__simdfX_store( output7, o0 );    stbir__simdfX_store( output7+stbir__simdfX_float_count, o1 );  stbir__simdfX_store( output7+(2*stbir__simdfX_float_count), o2 );   stbir__simdfX_store( output7+(3*stbir__simdfX_float_count), o3 ); )
       #endif
 
       input += (4*stbir__simdfX_float_count);
       stbIF0( output0 += (4*stbir__simdfX_float_count); ) stbIF1( output1 += (4*stbir__simdfX_float_count); ) stbIF2( output2 += (4*stbir__simdfX_float_count); ) stbIF3( output3 += (4*stbir__simdfX_float_count); ) stbIF4( output4 += (4*stbir__simdfX_float_count); ) stbIF5( output5 += (4*stbir__simdfX_float_count); ) stbIF6( output6 += (4*stbir__simdfX_float_count); ) stbIF7( output7 += (4*stbir__simdfX_float_count); )
     }
-    while ( ( (char*)input_end - (char*) input ) >= 16 ) 
+    STBIR_SIMD_NO_UNROLL_LOOP_START
+    while ( ( (char*)input_end - (char*) input ) >= 16 )
     {
       stbir__simdf o0, r0;
       STBIR_SIMD_NO_UNROLL(output0);
@@ -9692,13 +9935,14 @@ static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** output
       stbIF6( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c6 ) );   stbir__simdf_store( output6, o0 ); )
       stbIF7( stbir__simdf_mult( o0, r0, stbir__if_simdf8_cast_to_simdf4( c7 ) );   stbir__simdf_store( output7, o0 ); )
       #endif
-      
+
       input += 4;
       stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; )
     }
   }
   #else
-  while ( ( (char*)input_end - (char*) input ) >= 16 ) 
+  STBIR_NO_UNROLL_LOOP_START
+  while ( ( (char*)input_end - (char*) input ) >= 16 )
   {
     float r0, r1, r2, r3;
     STBIR_NO_UNROLL(input);
@@ -9729,7 +9973,8 @@ static void STBIR_chans( stbir__vertical_scatter_with_,_coeffs)( float ** output
     stbIF0( output0 += 4; ) stbIF1( output1 += 4; ) stbIF2( output2 += 4; ) stbIF3( output3 += 4; ) stbIF4( output4 += 4; ) stbIF5( output5 += 4; ) stbIF6( output6 += 4; ) stbIF7( output7 += 4; )
   }
   #endif
-  while ( input < input_end ) 
+  STBIR_NO_UNROLL_LOOP_START
+  while ( input < input_end )
   {
     float r = input[0];
     STBIR_NO_UNROLL(output0);
@@ -9779,7 +10024,7 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp,
     STBIR_MEMCPY( output, input0, (char*)input0_end - (char*)input0 );
     return;
   }
-#endif  
+#endif
 
   #ifdef STBIR_SIMD
   {
@@ -9791,14 +10036,15 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp,
     stbIF5(stbir__simdfX c5 = stbir__simdf_frepX( c5s ); )
     stbIF6(stbir__simdfX c6 = stbir__simdf_frepX( c6s ); )
     stbIF7(stbir__simdfX c7 = stbir__simdf_frepX( c7s ); )
-    
-    while ( ( (char*)input0_end - (char*) input0 ) >= (16*stbir__simdfX_float_count) ) 
+
+    STBIR_SIMD_NO_UNROLL_LOOP_START
+    while ( ( (char*)input0_end - (char*) input0 ) >= (16*stbir__simdfX_float_count) )
     {
       stbir__simdfX o0, o1, o2, o3, r0, r1, r2, r3;
       STBIR_SIMD_NO_UNROLL(output);
 
       // prefetch four loop iterations ahead (doesn't affect much for small resizes, but helps with big ones)
-      stbIF0( stbir__prefetch( input0 + (16*stbir__simdfX_float_count) ); ) 
+      stbIF0( stbir__prefetch( input0 + (16*stbir__simdfX_float_count) ); )
       stbIF1( stbir__prefetch( input1 + (16*stbir__simdfX_float_count) ); )
       stbIF2( stbir__prefetch( input2 + (16*stbir__simdfX_float_count) ); )
       stbIF3( stbir__prefetch( input3 + (16*stbir__simdfX_float_count) ); )
@@ -9836,7 +10082,8 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp,
       stbIF0( input0 += (4*stbir__simdfX_float_count); ) stbIF1( input1 += (4*stbir__simdfX_float_count); ) stbIF2( input2 += (4*stbir__simdfX_float_count); ) stbIF3( input3 += (4*stbir__simdfX_float_count); ) stbIF4( input4 += (4*stbir__simdfX_float_count); ) stbIF5( input5 += (4*stbir__simdfX_float_count); ) stbIF6( input6 += (4*stbir__simdfX_float_count); ) stbIF7( input7 += (4*stbir__simdfX_float_count); )
     }
 
-    while ( ( (char*)input0_end - (char*) input0 ) >= 16 ) 
+    STBIR_SIMD_NO_UNROLL_LOOP_START
+    while ( ( (char*)input0_end - (char*) input0 ) >= 16 )
     {
       stbir__simdf o0, r0;
       STBIR_SIMD_NO_UNROLL(output);
@@ -9860,7 +10107,8 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp,
     }
   }
   #else
-  while ( ( (char*)input0_end - (char*) input0 ) >= 16 ) 
+  STBIR_NO_UNROLL_LOOP_START
+  while ( ( (char*)input0_end - (char*) input0 ) >= 16 )
   {
     float o0, o1, o2, o3;
     STBIR_NO_UNROLL(output);
@@ -9881,7 +10129,8 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp,
     stbIF0( input0 += 4; ) stbIF1( input1 += 4; ) stbIF2( input2 += 4; ) stbIF3( input3 += 4; ) stbIF4( input4 += 4; ) stbIF5( input5 += 4; ) stbIF6( input6 += 4; ) stbIF7( input7 += 4; )
   }
   #endif
-  while ( input0 < input0_end ) 
+  STBIR_NO_UNROLL_LOOP_START
+  while ( input0 < input0_end )
   {
     float o0;
     STBIR_NO_UNROLL(output);
@@ -9897,7 +10146,7 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp,
     stbIF5( o0 += input5[0] * c5s; )
     stbIF6( o0 += input6[0] * c6s; )
     stbIF7( o0 += input7[0] * c7s; )
-    output[0] = o0; 
+    output[0] = o0;
     ++output;
     stbIF0( ++input0; ) stbIF1( ++input1; ) stbIF2( ++input2; ) stbIF3( ++input3; ) stbIF4( ++input4; ) stbIF5( ++input5; ) stbIF6( ++input6; ) stbIF7( ++input7; )
   }
@@ -9928,25 +10177,25 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp,
 #ifndef stbir__2_coeff_only
 #define stbir__2_coeff_only()             \
     stbir__1_coeff_only();                \
-    stbir__1_coeff_remnant(1);            
+    stbir__1_coeff_remnant(1);
 #endif
 
 #ifndef stbir__2_coeff_remnant
 #define stbir__2_coeff_remnant( ofs )     \
     stbir__1_coeff_remnant(ofs);          \
-    stbir__1_coeff_remnant((ofs)+1);      
+    stbir__1_coeff_remnant((ofs)+1);
 #endif
-    
+
 #ifndef stbir__3_coeff_only
 #define stbir__3_coeff_only()             \
     stbir__2_coeff_only();                \
-    stbir__1_coeff_remnant(2);            
+    stbir__1_coeff_remnant(2);
 #endif
-    
+
 #ifndef stbir__3_coeff_remnant
 #define stbir__3_coeff_remnant( ofs )     \
     stbir__2_coeff_remnant(ofs);          \
-    stbir__1_coeff_remnant((ofs)+2);      
+    stbir__1_coeff_remnant((ofs)+2);
 #endif
 
 #ifndef stbir__3_coeff_setup
@@ -9956,13 +10205,13 @@ static void STBIR_chans( stbir__vertical_gather_with_,_coeffs)( float * outputp,
 #ifndef stbir__4_coeff_start
 #define stbir__4_coeff_start()            \
     stbir__2_coeff_only();                \
-    stbir__2_coeff_remnant(2);            
+    stbir__2_coeff_remnant(2);
 #endif
-    
+
 #ifndef stbir__4_coeff_continue_from_4
 #define stbir__4_coeff_continue_from_4( ofs )     \
     stbir__2_coeff_remnant(ofs);                  \
-    stbir__2_coeff_remnant((ofs)+2);      
+    stbir__2_coeff_remnant((ofs)+2);
 #endif
 
 #ifndef stbir__store_output_tiny
@@ -9973,8 +10222,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_1_coeff)( floa
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__1_coeff_only();
     stbir__store_output_tiny();
@@ -9985,8 +10235,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_2_coeffs)( flo
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__2_coeff_only();
     stbir__store_output_tiny();
@@ -9997,8 +10248,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_3_coeffs)( flo
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__3_coeff_only();
     stbir__store_output_tiny();
@@ -10009,8 +10261,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_4_coeffs)( flo
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__4_coeff_start();
     stbir__store_output();
@@ -10021,8 +10274,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_5_coeffs)( flo
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__4_coeff_start();
     stbir__1_coeff_remnant(4);
@@ -10034,8 +10288,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_6_coeffs)( flo
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__4_coeff_start();
     stbir__2_coeff_remnant(4);
@@ -10048,10 +10303,11 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_7_coeffs)( flo
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
   stbir__3_coeff_setup();
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
-  
+
     stbir__4_coeff_start();
     stbir__3_coeff_remnant(4);
     stbir__store_output();
@@ -10062,8 +10318,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_8_coeffs)( flo
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__4_coeff_start();
     stbir__4_coeff_continue_from_4(4);
@@ -10075,8 +10332,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_9_coeffs)( flo
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__4_coeff_start();
     stbir__4_coeff_continue_from_4(4);
@@ -10089,8 +10347,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_10_coeffs)( fl
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__4_coeff_start();
     stbir__4_coeff_continue_from_4(4);
@@ -10104,8 +10363,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_11_coeffs)( fl
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
   stbir__3_coeff_setup();
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__4_coeff_start();
     stbir__4_coeff_continue_from_4(4);
@@ -10118,8 +10378,9 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_12_coeffs)( fl
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
     float const * hc = horizontal_coefficients;
     stbir__4_coeff_start();
     stbir__4_coeff_continue_from_4(4);
@@ -10132,12 +10393,14 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod0
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
-    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 4 + 3 ) >> 2; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
+    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 4 + 3 ) >> 2;
     float const * hc = horizontal_coefficients;
 
     stbir__4_coeff_start();
+    STBIR_SIMD_NO_UNROLL_LOOP_START
     do {
       hc += 4;
       decode += STBIR__horizontal_channels * 4;
@@ -10152,19 +10415,21 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod1
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
-    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 5 + 3 ) >> 2; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
+    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 5 + 3 ) >> 2;
     float const * hc = horizontal_coefficients;
 
     stbir__4_coeff_start();
+    STBIR_SIMD_NO_UNROLL_LOOP_START
     do {
       hc += 4;
       decode += STBIR__horizontal_channels * 4;
       stbir__4_coeff_continue_from_4( 0 );
       --n;
     } while ( n > 0 );
-    stbir__1_coeff_remnant( 4 ); 
+    stbir__1_coeff_remnant( 4 );
     stbir__store_output();
   } while ( output < output_end );
 }
@@ -10173,19 +10438,21 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod2
 {
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
-    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 6 + 3 ) >> 2; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
+    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 6 + 3 ) >> 2;
     float const * hc = horizontal_coefficients;
 
     stbir__4_coeff_start();
+    STBIR_SIMD_NO_UNROLL_LOOP_START
     do {
       hc += 4;
       decode += STBIR__horizontal_channels * 4;
       stbir__4_coeff_continue_from_4( 0 );
       --n;
     } while ( n > 0 );
-    stbir__2_coeff_remnant( 4 ); 
+    stbir__2_coeff_remnant( 4 );
 
     stbir__store_output();
   } while ( output < output_end );
@@ -10196,19 +10463,21 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod3
   float const * output_end = output_buffer + output_sub_size * STBIR__horizontal_channels;
   float STBIR_SIMD_STREAMOUT_PTR( * ) output = output_buffer;
   stbir__3_coeff_setup();
+  STBIR_SIMD_NO_UNROLL_LOOP_START
   do {
-    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels; 
-    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 7 + 3 ) >> 2; 
+    float const * decode = decode_buffer + horizontal_contributors->n0 * STBIR__horizontal_channels;
+    int n = ( ( horizontal_contributors->n1 - horizontal_contributors->n0 + 1 ) - 7 + 3 ) >> 2;
     float const * hc = horizontal_coefficients;
 
     stbir__4_coeff_start();
+    STBIR_SIMD_NO_UNROLL_LOOP_START
     do {
       hc += 4;
       decode += STBIR__horizontal_channels * 4;
       stbir__4_coeff_continue_from_4( 0 );
       --n;
     } while ( n > 0 );
-    stbir__3_coeff_remnant( 4 ); 
+    stbir__3_coeff_remnant( 4 );
 
     stbir__store_output();
   } while ( output < output_end );
@@ -10216,26 +10485,26 @@ static void STBIR_chans( stbir__horizontal_gather_,_channels_with_n_coeffs_mod3
 
 static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_funcs)[4]=
 {
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod0),  
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod1),  
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod2),  
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod3),  
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod0),
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod1),
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod2),
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_n_coeffs_mod3),
 };
 
 static stbir__horizontal_gather_channels_func * STBIR_chans(stbir__horizontal_gather_,_channels_funcs)[12]=
 {
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_1_coeff),  
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_2_coeffs),  
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_1_coeff),
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_2_coeffs),
   STBIR_chans(stbir__horizontal_gather_,_channels_with_3_coeffs),
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_4_coeffs),  
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_5_coeffs),  
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_6_coeffs),  
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_4_coeffs),
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_5_coeffs),
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_6_coeffs),
   STBIR_chans(stbir__horizontal_gather_,_channels_with_7_coeffs),
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_8_coeffs),  
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_9_coeffs),  
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_10_coeffs),  
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_11_coeffs),  
-  STBIR_chans(stbir__horizontal_gather_,_channels_with_12_coeffs),  
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_8_coeffs),
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_9_coeffs),
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_10_coeffs),
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_11_coeffs),
+  STBIR_chans(stbir__horizontal_gather_,_channels_with_12_coeffs),
 };
 
 #undef STBIR__horizontal_channels
@@ -10266,38 +10535,38 @@ This software is available under 2 licenses -- choose whichever you prefer.
 ------------------------------------------------------------------------------
 ALTERNATIVE A - MIT License
 Copyright (c) 2017 Sean Barrett
-Permission is hereby granted, free of charge, to any person obtaining a copy of 
-this software and associated documentation files (the "Software"), to deal in 
-the Software without restriction, including without limitation the rights to 
-use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies 
-of the Software, and to permit persons to whom the Software is furnished to do 
+Permission is hereby granted, free of charge, to any person obtaining a copy of
+this software and associated documentation files (the "Software"), to deal in
+the Software without restriction, including without limitation the rights to
+use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies
+of the Software, and to permit persons to whom the Software is furnished to do
 so, subject to the following conditions:
-The above copyright notice and this permission notice shall be included in all 
+The above copyright notice and this permission notice shall be included in all
 copies or substantial portions of the Software.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
-AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER 
-LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, 
-OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE 
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
 SOFTWARE.
 ------------------------------------------------------------------------------
 ALTERNATIVE B - Public Domain (www.unlicense.org)
 This is free and unencumbered software released into the public domain.
-Anyone is free to copy, modify, publish, use, compile, sell, or distribute this 
-software, either in source code form or as a compiled binary, for any purpose, 
+Anyone is free to copy, modify, publish, use, compile, sell, or distribute this
+software, either in source code form or as a compiled binary, for any purpose,
 commercial or non-commercial, and by any means.
-In jurisdictions that recognize copyright laws, the author or authors of this 
-software dedicate any and all copyright interest in the software to the public 
-domain. We make this dedication for the benefit of the public at large and to 
-the detriment of our heirs and successors. We intend this dedication to be an 
-overt act of relinquishment in perpetuity of all present and future rights to 
+In jurisdictions that recognize copyright laws, the author or authors of this
+software dedicate any and all copyright interest in the software to the public
+domain. We make this dedication for the benefit of the public at large and to
+the detriment of our heirs and successors. We intend this dedication to be an
+overt act of relinquishment in perpetuity of all present and future rights to
 this software under copyright law.
-THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR 
-IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, 
-FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE 
-AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN 
-ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION 
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN
+ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION
 WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
 ------------------------------------------------------------------------------
 */