Get the latest tech news
Experiments with Byte Matrix Multiplication
Various implementation of byte matrix multiplication - serge-sans-paille/i8mm
The important part is that it sums up adjacent integers after point-to-point multiplication, which is probably why the Clang compiler does not generate them. Which is relatively fast but there is an intermediate sum of two int16_t integers done with saturation through_mm_maddubs_epi16, with a potential data loss (if one takes extreme values for the inputs). Gemmology inherits from intgemm the notion of prepared matrices, where the data layout update is actually done in a routine, so that the computation could be faster.
Or read this on Hacker News