Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Entropy and Related Concepts¶
Entropy
Shannon Entropy
Cross Entropy
K-L divergence
Tips¶
The entropy concept was first introduced for discrete distributions (called Shannon entropy), which is defined as $Xf(x)X$. Shannon Entropy is non-negative. It is zero if and only if the discrete distribution is degenerate (all mass concentrate on one point).
Shannon entropy is proven to be the lower bound of bits per symbol ( is used instead of ) to transfer identifiable information from a source to a destination through a communication channel without data loss.
Shanno entropy is equivalent to entropy in thermodynamics (an area of physics).
Entropy is a good metric to measure the magnitude of “information” in features/variables in machine learning. It can be used to filter out non-useful features/variables.
The entropy concept can be extended to continuous distributions. However, the entropy of a continuous distribution can be negative. As a matter of fact, the entropy of a continuous distribution has a range of . Taking the exponential distribution with the density function as example, its entropy is which goes to as goes to and it goes to as goes to 0.
For the reason in bullet point 4, entropy is not a good measure for continuous distributions Cross-entropy and K-L divergence are more commonly used for both discrete and continuous distributions. The cross-entropy of a distribution q with respect to p is defined as $$ Notice that the K-L divergence is always non-negative.
In a multi-class classification problem, the following are equivalent.
minimizing the cross-entropy
minimizing the K-L divergence
maximizing the log likelihood of the corresponding multi-nomial distribution
minimizing the negative log likelihood (NLL) of the corresponding multi-nomial distribution
The above conclusion suggests that the cross-entropy loss, K-L loss and the NLL loss are equivalent. However, be aware that PyTorch defines cross-entropy loss to be different from the NLL loss. The cross-entropy loss in PyTorch is defined on the raw output of a neural network layer while the NLL loss is defined on the output of a log softmax layer. This means that in PyTorch the cross-entropy loss is equivalent to log_softmax + nll_loss.
Misc¶
Fisher information explanation
likelihood based tests: LRT, wald, score
expected fisher,
observed fisher (sum, log, law of large number)