Get the latest tech news
Regular expression search with suffix arrays (2015)
Back in January of 2012, Russ Cox posted an excellent blog post detailing how Google Code Search had worked, using a trigram index. By that point, I’d already implemented early versions of my own livegrep source-code search engine, using a different indexing approach that I developed independently, with input from a few friends. This post is my long-overdue writeup of how it works. Suffix Arrays A suffix array is a data structure used for full-text search and other applications, primarily these days in the field of bioinformatics.
(This is something of a simplification; livegrep actually uses a number of suffix arrays, instead of a single giant one, and we compress the input by deduplicating identical lines, which makes the file content map more complicated. The interesting feature of this approach is that applying a filename restriction doesn’t actually limit the search space that much — livegrep still walks the entire suffix array, and still considers each resulting match (although it can do a faster check against the content map, instead of calling into RE2). Notably, livegrep actually compresses the input significantly, which reduces the memory needed and speeds up the searches, at the cost of additional complexity building the index and interpreting results.
Or read this on Hacker News