Get the latest tech news
Load-Store Conflicts
meshoptimizer implements several geometry compression algorithms that are designed to take advantage of redundancies common in mesh data and decompress quickly - targeting many gigabytes per second in decoding throughput. One of them, index decoder, has seen a significant and unexpected variance in performance across multiple compilers and compiler releases recently; upon closer investigation, the differences can mostly be attributed to the same microarchitectural detail that is not often talked about. So I thought it would be interesting to write about it.
Needless to say, this was extremely expensive: it was common to see code that’s spending most of its time in LHS stalls on innocuous instruction sequences like repeatedly incrementing the size field stored inside a structure. To spare you a few hours of bisection to find the offending gcc commit and compare the loop code alongside performance metrics, let’s just immediately look at the way gcc-15 compiles the FIFO access now: This hasn’t been the first time I’ve encountered store-to-load forwarding issues on x86_64 CPUs; however, I’m more used to these happening as a result of the code that explicitly tries to load or store mismatched element sizes.
Or read this on Hacker News