Get the latest tech news
Memory and ILP handling in 2D convolutions
transpose fast like sanic
where \(f\) is the input image \(f \in \mathbb{R}^{N \times M}\) and \(g\) is the filter \(g \in \mathbb{R}^{K \times K}\), an array of floats that serve as the trainable parameters of the network commonly initialized to a random distribution and trained using stochastic gradiant descent. while the algorithm is agnostic to it's inputs, we'll consider a batch of 512 images from the MNIST dataset, stored linearly in memory as single precision floats with a height and width of \(M=N=28\) and one luminiosity channel: For example, the Floating Point Unit on the Zen 2 microarchitecture manages 4 execution pipes 256-bit wide that can handle x87, MMX, SSE and AVX instructions.
Or read this on Hacker News