Things on this page are fragmentary and immature notes/thoughts of the author. Please read with your own judgement!
Entropy and Related Concepts
 Entropy
 Shannon Entropy
 Cross Entropy
 KL divergence
Tips

The entropy concept was first introduced for discrete distributions (called Shannon entropy), which is defined as
$$H(X) = E[log(\frac{1}{f(x)})]$$where \(X\) stands for a discrete random variable (distribution) and \(f(x)\) is the probability density function of \(X\). Shannon Entropy is nonnegative. It is zero if and only if the discrete distribution is degenerate (all mass concentrate on one point). 
Shannon entropy is proven to be the lower bound of bits per symbol (\(log_2(x)\) is used instead of \(log(x)\)) to transfer identifiable information from a source to a destination through a communication channel without data loss.

Shanno entropy is equivalent to entropy in thermodynamics (an area of physics).

Entropy is a good metric to measure the magnitude of "information" in features/variables in machine learning. It can be used to filter out nonuseful features/variables.

The entropy concept can be extended to continuous distributions. However, the entropy of a continuous distribution can be negative. As a matter of fact, the entropy of a continuous distribution has a range of \((\infty, \infty)\). Taking the exponential distribution with the density function \(\frac{1}{\mu}e^{\frac{x}{\mu}}\) as example, its entropy is \(log(\mu)+1\) which goes to \(\infty\) as \(\mu\) goes to \(\infty\) and it goes to \(\infty\) as \(\mu\) goes to 0.

For the reason in bullet point 4, entropy is not a good measure for continuous distributions Crossentropy and KL divergence are more commonly used for both discrete and continuous distributions. The crossentropy of a distribution q with respect to p is defined as
$$H(p, q) = E_p[log(q)]$$And the KL divergence (also called relative entropy) is defined as$$D_{KL}(p, q) = E_p[log(\frac{1}{q})  log(\frac{1}{p})] = H(p, q)  H(p)$$Notice that the KL divergence is always nonnegative. 
In a multiclass classification problem, the following are equivalent.
 minimizing the crossentropy
 minimizing the KL divergence
 maximizing the log likelihood of the corresponding multinomial distribution
 minimizing the negative log likelihood (NLL) of the corresponding multinomial distribution
The above conclusion suggests that the crossentropy loss, KL loss and the NLL loss are equivalent. However, be aware that PyTorch defines crossentropy loss to be different from the NLL loss. The crossentropy loss in PyTorch is defined on the raw output of a neural network layer while the NLL loss is defined on the output of a log softmax layer. This means that in PyTorch the crossentropy loss is equivalent to log_softmax + nll_loss.
Misc
Fisher information explanation
likelihood based tests: LRT, wald, score
expected fisher,
observed fisher (sum, log, law of large number)