Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!

  1. Entropy

  2. Shannon Entropy

  3. Cross Entropy

  4. K-L divergence

Tips

  1. The entropy concept was first introduced for discrete distributions (called Shannon entropy), which is defined as $H(X)=E[log(1f(x))]H(X) = E[log(\frac{1}{f(x)})]where where Xstandsforadiscreterandomvariable(distribution)and stands for a discrete random variable (distribution) and f(x)istheprobabilitydensityfunctionof is the probability density function of X$. Shannon Entropy is non-negative. It is zero if and only if the discrete distribution is degenerate (all mass concentrate on one point).

  2. Shannon entropy is proven to be the lower bound of bits per symbol (log2(x)log_2(x) is used instead of log(x)log(x)) to transfer identifiable information from a source to a destination through a communication channel without data loss.

  3. Shanno entropy is equivalent to entropy in thermodynamics (an area of physics).

  4. Entropy is a good metric to measure the magnitude of “information” in features/variables in machine learning. It can be used to filter out non-useful features/variables.

  5. The entropy concept can be extended to continuous distributions. However, the entropy of a continuous distribution can be negative. As a matter of fact, the entropy of a continuous distribution has a range of (,)(-\infty, \infty). Taking the exponential distribution with the density function 1μexμ\frac{1}{\mu}e^{-\frac{x}{\mu}} as example, its entropy is log(μ)+1log(\mu)+1 which goes to \infty as μ\mu goes to \infty and it goes to -\infty as μ\mu goes to 0.

  6. For the reason in bullet point 4, entropy is not a good measure for continuous distributions Cross-entropy and K-L divergence are more commonly used for both discrete and continuous distributions. The cross-entropy of a distribution q with respect to p is defined as $H(p,q)=Ep[log(q)]H(p, q) = E_p[-log(q)]AndtheKLdivergence(alsocalledrelativeentropy)isdefinedas And the K-L divergence (also called relative entropy) is defined asDKL(p,q)=Ep[log(1q)log(1p)]=H(p,q)H(p)D_{KL}(p, q) = E_p[log(\frac{1}{q}) - log(\frac{1}{p})] = H(p, q) - H(p)$ Notice that the K-L divergence is always non-negative.

  7. In a multi-class classification problem, the following are equivalent.

    • minimizing the cross-entropy

    • minimizing the K-L divergence

    • maximizing the log likelihood of the corresponding multi-nomial distribution

    • minimizing the negative log likelihood (NLL) of the corresponding multi-nomial distribution

    The above conclusion suggests that the cross-entropy loss, K-L loss and the NLL loss are equivalent. However, be aware that PyTorch defines cross-entropy loss to be different from the NLL loss. The cross-entropy loss in PyTorch is defined on the raw output of a neural network layer while the NLL loss is defined on the output of a log softmax layer. This means that in PyTorch the cross-entropy loss is equivalent to log_softmax + nll_loss.

Misc

Fisher information explanation

likelihood based tests: LRT, wald, score

expected fisher,

observed fisher (sum, log, law of large number)

References