NOTE: This StatQuest was supported by these awesome people who support StatQuest at the Double BAM level: Z. Rosenberg, S. Shah, J. N., J. Horn, J. Wong, I. Galic, H-S. Ming, D. Greene, D. Schioberg, C. Walker, G. Singh, L. Cisterna, J. Alexander, J. Varghese, K. Manickam, N. Fleming, F. Prado, J. Malone-Lee

Send me the note in pdf

Send me the note on entropy in pdf.

You are one of the geniuses in teaching I have ever come across.

Thank you very much Josh!

Hi Josh,

I am dissatisfied with your explanation for surprise being log(1/p) on the basis that it is 0 when p is 1 and infinity when p is 0. There are a zillion functions of p with those limits. Why is log(1/p) preferred over all of the others? Is it just a convention chosen to match the physics definition and because of the convenient mathematical properties of the log? Or is there something more to it?

If you want a more mathematically grounded explanation, I would highly recommend the original manuscript by Shannon: https://people.math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf

I believe it might have something to do with simple calculations to compute gradients when softmax is paired with the cross-entropy loss.