Get the latest tech news
Basic Facts about GPUs
Making sure I don’t forget what I read.
For a single, complex operation with high potential arithmetic intensity (like matrix multiplication), the strategy is to use tiling to maximize data reuse within the SM’s fast memory. To increase data reuse and become compute-bound, threads within a block must cooperate to load large tiles of the input matrices into the SM’s fast, on-chip Shared Memory. A kernel can be compute-bound but still be slow if its FLOPs are inefficient (e.g., using scalar FP32 math instead of specialized hardware like Tensor Cores) or if the GPU operates below its peak clock speed due to power limits.
Or read this on Hacker News