Get the latest tech news
Show HN: I built a toy TPU that can do inference and training on the XOR problem
An attempt to understand and build a TPU—by complete novices.
Because everything in our design was staggered, we realized that we could very elegantly assert a start signal for a single clock cycle at the first accumulator and have it propagate to neighbouring modules exactly when they needed to be turned on. While we were working on this, we realized we could make one last small optimization for the activation derivative module — since we only use the H2\mathbf{H}_2H2 values once (for computing ∂H2∂Z2\frac{\partial \mathbf{H}_2}{\partial \mathbf{Z}_2}∂Z2∂H2), we created a tiny cache within the VPU instead of storing them in the UB. But we can confirm that every single one of these bits is needed and we ensured that we couldn't make the instruction set any smaller without compromising the speed and efficiency of the TPU.
Or read this on Hacker News