Enable better vectorization for generic convolution

Break the single dependence chain into two parallel sub-chains.
Provides 2-4% performance uplift as measured on modern ARM systems
when using the generic codepath.
This commit is contained in:
Mahesh Madhav 2024-04-24 20:32:08 +00:00
parent 49ab34dfef
commit 1fd5cd8c7e
1 changed files with 692 additions and 641 deletions

File diff suppressed because it is too large Load Diff