Get the latest tech news
Bit-permuting 16 u32s at once with AVX-512
The basic trick to apply the same bit-permutation to each of the u32 s is to view them as matrix of 16 rows by 32 columns, transpose it ...
Taking the solutions from that AVX512 BPC permute solver and making them the bread of the transpose-shuffle-transpose sandwich is a valid solution, but there is an opportunity for improvement: that sandwich ends up with 3 back-to-back permutes in the middle, perhaps they can be merged somehow. Keeping the most-significant bit in place simplifies the first "transpose" to no longer need a shuffle at the end, but now instead of shuffling each pair of adjacent bytes the same way (which a permutation of 16-bit elements accomplishes for free) we need to shuffle the low and high half of a 64-byte vector the same way, which requires some pre-processing of the index-vector: duplicate a 32-byte vector into the top and bottom of a 64-byte vector, and add 32 to every byte in the top half. The second not-quite-transpose in the new sandwich, corresponding to a bit-order of 3,4,5,6,7,0,1,2,8, starts with a simple permute: byte-reverse every u64.
Or read this on Hacker News