Get the latest tech news
Beating the L1 cache with value speculation (2021)
If we have a heuristic to guess some value cheaply, we can remove a data dependency in a tight loop using the branch predictor. This allows the CPU to run more instructions in parallel, increasing performance.
In this post I explain the machinery involved, including a primer on branch prediction and CPU caches, so that anybody with a passing knowledge of C and how code is executed on CPUs should be able to follow. However, executing many instructions at once is so important that dedicated hardware – the branch predictor – is present in all modern CPUs to make an educated guess on which way we’ll go at every conditional jump. The code relies on the fact that node can’t be NULL after we increment it if it is equal to next, avoiding an additional test, and taking only 5 instructions per element (from loop_body to je loop_body in the happy path).
Or read this on Hacker News