Get the latest tech news
Performance optimization, and how to do it wrong
Optimization is hard. And sometimes, the compiler makes it even harder.
In addition to SIMD loads and fmadds I use the optimized loop order and register blocking (using the seq macro) techniques from this paper. The benchmark uses unpadded, unstrided, undilated and ungrouped convolutions, so I stripped all padding checks and all stride/dilation calculations - it was faster, but still slow. To add back padding, stride and dilation, without tanking the performance again, I decided to use compile-time monomorphization to eliminate the common zero-padding and/or unit stride/dilation cases.
Or read this on Hacker News