Get the latest tech news
Why Momentum Works (2017)
We often think of optimization with momentum as a ball rolling down a hill. This isn't wrong, but there is much more to the story.
So starting at a simple initial point like 000 (by a gross abuse of language, let’s think of this as a prior), we track the iterates till a desired level of complexity is reached. And in fact, it has been recently suggested [12] that this noise is a good thing — it acts as a implicit regularizer, which, like early stopping, prevents overfitting in the fine-tuning phase of optimization. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights[PDF] W. Su, S. Boyd, E. Candes.Advances in Neural Information Processing Systems, pp.
Or read this on Hacker News