Get the latest tech news

Softmax forever, or why I like softmax

[UPDATE: Feb 8 2025] my amazing colleague Max Shen noticed a sign mistake in my derivation of the partial derivative of the log-harmonic function below. i taught my first full-semester course on <Natural Language Processing with Distributed Representation> in fall 2015 (whoa, a decade ago!) you can find the lecture note from this course at https://arxiv.org/abs/1511.07916.

in one of the lectures, David Rosenberg, who was teaching machine learning at NYU back then and had absolutely no reason other than kindness to sit in at my course, asked why we use softmax and whether this is the only way to turn unnormalized real values into a categorial distribution. although i could not pull out the answer immediately on the spot (you gotta give my 10-years-younger me a bit of a slack; it was my first full course,) there are a number of reasons why we like softmax. aren’t they beautiful?during the past few days, there was a paper posted on arXiv that claims that so-called harmonic formulation is better than softmax, in terms of training a neural net.

Get the Android app

Or read this on Hacker News