Get the latest tech news
Counting Words at SIMD Speed
Rewriting a word counting program five times until it's 494x faster.
So a call like re.finditer(pattern, data) spends nearly all of its time inside C, scanning contiguous memory with pointer arithmetic and table lookups. This doesn't result in six vector compare-equal ( cmeq) instructions because the compiler is able to work out that, while we need exact-match checks for' ' and'\n', there are more efficient steps for the following groups: While it looks like more instructions are touching the data, the key thing is that we're replacing an expensive sequence (another equality compare + mask merge + constant load) with a very cheap transform plus a single comparison.
Or read this on Hacker News