Get the latest tech news
Static search trees: faster than binary search
Table of Contents 1 Introduction 1.1 Problem statement 1.2 Recommended reading 1.3 Binary search and Eytzinger layout 1.4 Hugepages 1.5 A note on benchmarking 1.6 Cache lines 1.7 S-trees and B-trees 2 Optimizing find 2.1 Linear 2.2 Auto-vectorization 2.3 Trailing zeros 2.4 Popcount 2.5 Manual SIMD 3 Optimizing the search 3.1 Batching 3.2 Prefetching 3.3 Pointer arithmetic 3.3.1 Up-front splat 3.3.2 Byte-based pointers 3.3.3 The final version 3.4 Skip prefetch 3.5 Interleave 4 Optimizing the tree layout 4.1 Left-tree 4.2 Memory layouts 4.3 Node size \(B=15\) 4.3.1 Data structure size 4.4 Summary 5 Prefix partitioning 5.1 Full layout 5.2 Compact subtrees 5.3 The best of both: compact first level 5.4 Overlapping trees 5.5 Human data 5.6 Prefix map 5.7 Summary 6 Multi-threaded comparison 7 Conclusion 7.1 Future work 7.1.1 Branchy search 7.1.2 Interpolation search 7.1.3 Packing data smaller 7.1.4 Returning indices in original data 7.1.5 Range queries In this post, we will implement a static search tree (S+ tree) for high-throughput searching of sorted data, as introduced on Algorithmica. We’ll mostly take the code presented there as a starting point, and optimize it to its limits. For a large part, I’m simply taking the ‘future work’ ideas of that post and implementing them. And then there will be a bunch of looking at assembly code to shave off all the instructions we can. Lastly, there will be one big addition to optimize throughput: batching.
Figure 28: Building the overlapping trees for k-mers of the human genome takes much more space, and even using only 16 parts regularly requires up to 50% overhead, making this data structure not quite practical. Although memory usage is now similar to the unpartitioned version, queries for large inputs are slightly slower than those previous layouts due to the additional index required. On the other hand, the layers saved would mostly be the quick lookups near the root of the tree, and introducing branches to the code could possibly cause quite a bit of delay due to mispredictions.
Or read this on Hacker News