Entropy

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

Entropy
Shannon Entropy
Cross Entropy
K-L divergence

Tips¶

The entropy concept was first introduced for discrete distributions (called Shannon entropy), which is defined as $ $H(X) = E[log(\frac{1}{f(x)})]$ $where$ X $stands for a discrete random variable (distribution) and$ f(x) $is the probability density function of$ X$. Shannon Entropy is non-negative. It is zero if and only if the discrete distribution is degenerate (all mass concentrate on one point).
Shannon entropy is proven to be the lower bound of bits per symbol ( $log_2(x)$ is used instead of $log(x)$ ) to transfer identifiable information from a source to a destination through a communication channel without data loss.
Shanno entropy is equivalent to entropy in thermodynamics (an area of physics).
Entropy is a good metric to measure the magnitude of “information” in features/variables in machine learning. It can be used to filter out non-useful features/variables.
The entropy concept can be extended to continuous distributions. However, the entropy of a continuous distribution can be negative. As a matter of fact, the entropy of a continuous distribution has a range of $(-\infty, \infty)$ . Taking the exponential distribution with the density function $\frac{1}{\mu}e^{-\frac{x}{\mu}}$ as example, its entropy is $log(\mu)+1$ which goes to $\infty$ as $\mu$ goes to $\infty$ and it goes to $-\infty$ as $\mu$ goes to 0.
For the reason in bullet point 4, entropy is not a good measure for continuous distributions Cross-entropy and K-L divergence are more commonly used for both discrete and continuous distributions. The cross-entropy of a distribution q with respect to p is defined as $ $H(p, q) = E_p[-log(q)]$ $And the K-L divergence (also called relative entropy) is defined as$ $D_{KL}(p, q) = E_p[log(\frac{1}{q}) - log(\frac{1}{p})] = H(p, q) - H(p)$ $ Notice that the K-L divergence is always non-negative.
In a multi-class classification problem, the following are equivalent.
- minimizing the cross-entropy
- minimizing the K-L divergence
- maximizing the log likelihood of the corresponding multi-nomial distribution
- minimizing the negative log likelihood (NLL) of the corresponding multi-nomial distribution
The above conclusion suggests that the cross-entropy loss, K-L loss and the NLL loss are equivalent. However, be aware that PyTorch defines cross-entropy loss to be different from the NLL loss. The cross-entropy loss in PyTorch is defined on the raw output of a neural network layer while the NLL loss is defined on the output of a log softmax layer. This means that in PyTorch the cross-entropy loss is equivalent to log_softmax + nll_loss.

Misc¶

Fisher information explanation

likelihood based tests: LRT, wald, score

expected fisher,

observed fisher (sum, log, law of large number)

References¶

A Gentle Introduction to Cross-Entropy for Machine Learning

Entropy and Related Concepts¶

Tips¶

Misc¶

References¶