Get the latest tech news

Performance optimization, and how to do it wrong

Optimization is hard. And sometimes, the compiler makes it even harder.

In addition to SIMD loads and fmadds I use the optimized loop order and register blocking (using the seq macro) techniques from this paper. The benchmark uses unpadded, unstrided, undilated and ungrouped convolutions, so I stripped all padding checks and all stride/dilation calculations - it was faster, but still slow. To add back padding, stride and dilation, without tanking the performance again, I decided to use compile-time monomorphization to eliminate the common zero-padding and/or unit stride/dilation cases.

Get the Android app

Or read this on Hacker News