Get the latest tech news
Histogramming Bytes with Positional Popcount
A while ago, after some back and forth on twitter/X with @corsix , I dropped some implementation of byte histogramming without explaini...
The amount of "heavy vertical summation" (assuming that it is a heavy operation) can be reduced by putting the rows through some "vertical" carry-save adders (which works out to some cheap bitwise operations, especially cheap on AVX512 / AVX10 thanks to ternlog), this technique is also discussed in for example Efficient Computation of Positional Population Counts Using SIMD Instructions. "Efficient Computation of Positional Population Counts Using SIMD Instructions" and the associated code show various ways to do this without using GF2P8AFFINEQB. The pospopcnt as I described it in the previous paragraph only produces 64 counts, so it's suitable for making a histogram of data in the range from 0 up to 64.
Or read this on Hacker News