Get the latest tech news
Ε, a Nuisance No More
For a while now I have been advocating for tuning ε in various parts of the modern deep learning stack, and in this post I’ll explain why.
Each time we calculate this, we increase or decrease the damping (ε) term by a multiplicative factor; see section 6 of the KFAC paper[13] for a more detailed explanation. One key thing to note is that Batch Norm is oftentimes applied channelwise to convolutional layers, and so by increasing the ε in the denominator we are effectively temperature scaling/smoothing out the differences in magnitude across activation channels. In the end, I believe all of this evidence goes to show that optimizer hyperparameter tuning can have dramatic effects on model performance, and so it should always be done with care and deliberation (and more importantly, always detailed in any resulting papers!)
Or read this on Hacker News