Get the latest tech news
Coding Neon Kernels for the Cortex-A53
Some weeks ago, I presented at FOSDEM my work-in-progress high performance SDR runtime qsdr. I showed a hand-written NEON assembly implementation of a kernel that computes \(y[n] = ax[n] + b\), which I used as the basic math block for benchmarks on a Kria KV260 board (which has a quad-core ARM Cortex-A53 at 1.33 GHz).
I showed a hand-written NEON assembly implementation of a kernel that computes \(y[n] = ax[n] + b\), which I used as the basic math block for benchmarks on a Kria KV260 board (which has a quad-core ARM Cortex-A53 at 1.33 GHz). This code shown above is very bad for performance, because the fadd needs to stall for three cycles to wait for the result of the fmul to be available, but my main point here is to explain how the math will be calculated. Here LLVM has failed to understand that to properly hide the fmul and fadd result latency, using four NEON registers instead of two is necessary (interestingly, llvm-mca predicts a stall of 4 cycles rather than 2).
Or read this on Hacker News