Get the latest tech news
Phrase matching in Marginalia Search
Marginalia Search now properly supports phrase matching. This not only permits a more robust implementation of quoted search queries, but also helps promote results where the search terms occur in the document exactly in the same order as they do in the query. This is a write-up about implementing this change. This is going to be a relatively long post, as it represents about 4 months of work. I’m also happy and grateful to announce that the nlnet people reached out after the run of the grant was over and asked me if I had more work in the pipe, and agreed to fund this change as well!
It would be logistically impossible to store all word n-grams present in every document due to combinatoric explosion, so limiting heuristics were needed to identify which were likely to be important. There are even faster vectorized implementations of varints (e.g. Lemire et al’s Stream VByte), but I’ve been unable to get them to perform well in Java, largely due to inadequate access to the PSHUFB instruction. The PR consisted of over 200 commits, with a delta of nearly 20,000 lines of code changed, owing to the fact that this breaks binary compatibility for the index and hasn’t been possible to merge until it’s all done.
Or read this on Hacker News