Entropy
Consider you play a number guessing game. You have to guess a natural number between 1 and 100. If I tell you that the number is not 1, then
you still have 99 possible choices. So, this information is not valuable. In this case, we can say that the information content
of this clue is low. However, if I tell you that the number is divisible by 11, the number of choices is reduced to 9 only and thus the information
content of this clue is much higher.
This example illustrates that information content (or self-information) measures the "uncertainty" associated with an
event. The lower the probability of an event \(p(E)\), the higher its information content, \(I(E)\). So, we can define the information content as follows:
\[
I(E) = - \log p(E) \geq 0.
\]
Note: If two events \(A\) and \(B\) are independent, the information content is additive:
\[
\begin{align*}
I(A, B) &= - \log(p(A)\cdot p(B)) \\\\
&= - (\log p(A) + \log p(B)) \\\\
&= I(A) + I(B).
\end{align*}
\]
This is the main reason why we use the logarithm. We can convert the multiplication of probabilities into the
addition of information contents.
We extend this concept by quantifying the "expected value" of the information content across all possible outcomes of a
discrete random variable \(X\).
The entropy of a discrete random variable \(X\) with distribution \(p\) over
\(n\) states is defined as:
\[
\begin{align*}
\mathbb{H}(X) &= - \sum_{k=1}^n p(X = k) \log_2 p(X = k) \\\\
&= \mathbb{E }_X [- \log_2 p(X)]
\end{align*}
\]
Note: The choice of logarithmic base determines the units of entropy. In this case, the units is bits.
Also, often we use the natural logarithm (base \(e\))with the units nats.
Joint Entropy
The joint entropy between two random variables \(X\) and \(Y\) is defined as
\[
\mathbb{H}(X, Y) = - \sum_{x, y} p(x, y) \log p(x, y).
\]
If \(X\) and \(Y\) are independent, \(\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y)\). Otherwise, the sum is
larger than \(\mathbb{H}(X, Y)\). Also, if \(Y\) is a deterministic function of \(X\), then \(\mathbb{H}(X, Y) = \mathbb{H}(X)\). Thus,
\[
\mathbb{H}(X) + \mathbb{H}(Y) \geq \mathbb{H}(X, Y) \geq \max\{\mathbb{H}(X), \mathbb{H}(Y)\} \geq 0.
\]
This is true for more than two random variables.
Conditional Entropy
The conditional entropy of \(Y\) given \(X\) is the uncertainty we have in \(Y\) after
seeing \(X\), averaged over possible values for \(X\):
\[
\mathbb{H}(Y | X) = \mathbb{E }_{p(X)}[ \mathbb{H}(p(Y | X))]
\]
\(\mathbb{H}(Y | X) = \mathbb{H}(X, Y) - \mathbb{H}(X)\):
\[
\begin{align*}
\mathbb{H}(Y | X) &= \mathbb{E }_{p(X)}[ \mathbb{H}(p(Y | X))] \\\\
&= \sum_{x} p(x) \mathbb{H}(p(Y | X = x)) \\\\
&= - \sum_{x} p(x) \sum_{y} p(y | x) \log p(y | x) \\\\\
&= - \sum_{x, y} p(x, y) \log p(y | x) \\\\
&= - \sum_{x, y} p(x, y) \log \frac{p(x,y)}{p(x)}\\\\
&= - \sum_{x, y} p(x, y) \log p(x,y) + \sum_{x} p(x) \log p(x) \\\\
&= \mathbb{H}(X, Y) - \mathbb{H}(X) \tag{1}
\end{align*}
\]
Note: If \(Y\) can be completely determined by \(X\), \(\mathbb{H}(Y | X) = 0\).
Also, if both \(X\) and \(Y\) are independent each other, \(\mathbb{H}(Y | X) = \mathbb{H}(Y) \).
By (1), \(\mathbb{H}(X, Y) = \mathbb{H}(X) + \mathbb{H}(Y | X) \). This implies the chain rule of entropy:
\[
\mathbb{H}(X_1, X_2, \cdots, X_n) = \sum_{i =1}^n \mathbb{H}(X_i |X_1, \cdots, X_{i-1}).
\]
Cross Entropy
The cross entropy of a distribution \(q\) relative to a distribution \(p\) is defined as
\[
\mathbb{H}(p, q) = - \sum_{k=1}^n p_k \log q_k.
\]
In machine learning, especially in the context of classification problems and neural networks, often we use the
cross entropy as a loss function. It measures the performance of a model whose output is a probability distribution.
Or, it helps in training models by penalizing the difference between the true labels and the predicted probabilities.
Generally, the smaller the cross entropy loss, the better the predictions are.
For example, consider a binary classification(n =2). The true label \(y \in \{0, 1\}\) is represented by
a distribution \(p = [y, 1-y]\), and the predicted probability \(\hat{y} \in (0, 1)\) is represented by Bernoulli
distribution \(q = [\hat{y}, 1 - \hat{y}]\). Then we obtain a cross-entropy loss function:
\[
\begin{align*}
\mathcal{L}(y, \hat{y}) &= - [y \log (\hat{y}) + (1 - y) \log (1 - \hat{y})] \\\\
&= - y \log (\hat{y}) - (1 - y) \log (1 - \hat{y}) \\
\end{align*}
\]
Let's say the true labe is 1, and then the loss function becomes \(- \log (\hat{y})\).
If \(\hat{y}\) is close to 1 (the model is confident and correct), the penalty becomes small. On the other hand,
if \(\hat{y}\) is close to 0 (the model is confident and wrong), the penalty becomes large. So, the higher penalty(loss)
indicates the model need to be improved and forces the optimization algorithm to adjust parameters for reducing future error.
KL Divergence (Relative Entropy, Information Gain)
It is important to define a distance metric to measure "how similar" two distributions \(p\) and \(q\) are. However,
instead, we can define a divergence measure \(D(p,q)\), which only requires \(D(p, q) \geq 0\) with
equality if and only if \(p = q\). Here, we introduce the Kullback-Leibler divergence between two
distributions \(p\) and \(q\):
\[
\begin{align*}
D_{\mathbb{KL}}(p \| q) &= \sum_{k=1}^n p_k \log \frac{p_k}{q_k} \\\\
&= \sum_{k=1}^n p_k \log p_k - \sum_{k=1}^n p_k \log q_k \\\\
&= - \mathbb{H}(p) + \mathbb{H}(p, q) \tag{2}
\end{align*}
\]
where \(p\) and \(q\) are defined on the same sample space \(\mathcal{X}\).
For example, \(D_{\mathbb{KL}}(p \| q)\) can measure how much an estimated distribution \(q\) is different from
a true distribution \(p\). If the estimation of \(p\) by \(q\) is "good," the KL divergence will be close to zero.
By the expression (2), the cross entropy can be written as:
\[
\mathbb{H}(p, q) = \mathbb{H}(p) + D_{\mathbb{KL}}(p \| q).
\]
So, if \(p(x)\) is fixed, minimizing the cross entropy is equivalent to minimizing KL divergence.
Theorem 1: Non-negativity of KL Divergence
\[
D_{\mathbb{KL}}(p \| q) \geq 0
\]
where \(p\) and \(q\) are discrete distributions defined on the same sample space \(\mathcal{X}\) and with
equality if and only if \(p = q\).
Or, the entropy of \(p\) is less than or equal to its cross entropy with any other distribution \(q\):
\[
\begin{align*}
&\sum_{k=1}^n p_k \log p_k - \sum_{k=1}^n p_k \log q_k \geq 0 \\\\
&\Longrightarrow - \sum_{k=1}^n p_k \log p_k \leq - \sum_{k=1}^n p_k \log q_k.
\end{align*}
\]
This is called Gibbs' inequality.
There are many ways to prove this theorem, here we introduce log sum inequality to prove Theorem 1.
Theorem 2: Log Sum Inequality
For nonnegative numbers \(a_1, a_2, \cdots, a_n\) and \(b_1, b_2, \cdots, b_n\),
\[
\sum_{k=1}^n a_k \log \frac{a_k}{b_k} \geq \Big(\sum_{k=1}^n a_k \Big) \log \frac{\sum_{k=1}^n a_k}{\sum_{k=1}^n b_k} \tag{3}
\]
with equality if and only if \(\frac{a_k}{b_k} = constant\).
Proof of Theorem 2:
Assume without loss of generality that \(a_k >0\) and \(b_k >0\). Consider a function \(f(x) = x \log x\). Since
\(f'(x) = \log x + 1\) and \(f''(x) = \frac{1}{x}\), the function is a convex function for all \(x > 0\).
Let \(\lambda_i = \frac{b_k}{\sum_{k=1}^n b_k}\), then the left side of inequality (3) becomes
\[
\begin{align*}
\sum_{k=1}^n a_k \log \frac{a_k}{b_k}
&= \sum_{k=1}^n b_k \frac{a_k}{b_k} \log \frac{a_k}{b_k} \\\\
&= \sum_{k=1}^n b_k \frac{\sum_{k=1}^n b_k}{\sum_{k=1}^n b_k}f(\frac{a_k}{b_k}) \\\\
&= \Big(\sum_{k=1}^n b_k \Big) \sum_{k=1}^n \lambda_k f(\frac{a_k}{b_k})\\\\
\end{align*}
\]
Here, by Jensen's inequality,
\[
\begin{align*}
\Big(\sum_{k=1}^n b_k \Big) \sum_{k=1}^n \lambda_k f(\frac{a_k}{b_k})
&\geq \Big(\sum_{k=1}^n b_k \Big) f\Big(\sum_{k=1}^n \lambda_k \frac{a_k}{b_k} \Big) \\\\
&= \Big(\sum_{k=1}^n b_k \Big) f\Big( \frac{\sum_{k=1}^n a_k}{\sum_{k=1}^n b_k} \Big) \\\\
&= \Big(\sum_{k=1}^n a_k \Big) \log \frac{\sum_{k=1}^n a_k}{\sum_{k=1}^n b_k}
\end{align*}
\]
Therefore,
\[
\sum_{k=1}^n a_k \log \frac{a_k}{b_k} \geq \Big(\sum_{k=1}^n a_k \Big) \log \frac{\sum_{k=1}^n a_k}{\sum_{k=1}^n b_k}.
\]
Theorem 3: Jensen's inequality
If \(\phi\) is convex on an open interval \(I\), and \(X\) is a random variable whose support is
contained in \(I\), and has a finite expected value, then
\[
\phi (\mathbb{E }[X]) \leq \mathbb{E }[\phi (X)].
\]
Note: The support of a discrete random variable \(X\) to be the points in the space of \(X\) which has
positive probability.
Note: In the above proof, we used the following alternative finite form:
\[
f\Big(\sum_{k=1}^n \lambda_k x_k \Big) \leq \sum_{k=1}^n \lambda_k f(x_k)
\]
where \(\lambda_k \geq 0\) and \(\sum_{k=1}^n \lambda_k = 1\).
Finally, we can easily prove non-negativity of KL divergence using log sum inequality.
Proof of Theorem 1:
Suppose \(\sum_{k=1}^n p_k = \sum_{k=1}^n q_k = 1\) and \(p_k, \, q_k \geq 0\).
By the definition of LK divergence,
\[
D_{\mathbb{KL}}(p \| q) = \sum_{k=1}^n p_k \log \frac{p_k}{q_k}.
\]
Using Log sum inequality with \(a_k = p_k\) and \(b_k = q_k\),
\[
\sum_{k=1}^n p_k \log \frac{p_k}{q_k} \geq \Big(\sum_{k=1}^n p_k \Big) \log \frac{\sum_{k=1}^n p_k}{\sum_{k=1}^n q_k}.
\]
Since \(\sum_{k=1}^n p_k = \sum_{k=1}^n q_k = 1\),
\[
\sum_{k=1}^n p_k \log \frac{p_k}{q_k} \geq \Big(\sum_{k=1}^n p_k \Big) \log 1 = 0
\]
with equality if and only if \(\forall k, \, p_k = q_k\).
Again, KL divergence is NOT a metric. KL divergence is not symmetric and does not satisfy the triangle inequality.
Mutual Information (MI)
To measure mutual dependency of two random variables, we measure similarity of their distributions.
The mutual information between two random variable \(X\) and \(Y\) is defined as:
\[
\begin{align*}
\mathbb{I }(X ; Y) &= D_{\mathbb{KL}}(p(x,y) \| p(x)p(y)) \\\\
&= \sum_{y \in Y} \sum_{x \in X} p(x, y) \log \frac{p(x,y)}{p(x)p(y)} \geq 0.
\end{align*}
\]
Or, using etropies:
\[
\begin{align*}
\mathbb{I }(X ; Y) &= \sum_{y \in Y} \sum_{x \in X} p(x, y) \log p(x,y)
- \sum_{y \in Y} \sum_{x \in X} p(x, y) \log p(x) - \sum_{y \in Y} \sum_{x \in X} p(x, y) \log p(y) \\\\
&= \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y)
\end{align*}
\]
Using the chain rule for entropy: \(\mathbb{H}(X, Y) = \mathbb{H}(Y| X) + \mathbb{H}(X) \,\),
we obtain following expressions:
\[
\begin{align*}
\mathbb{I }(X ; Y) &= \mathbb{H}(X) + \mathbb{H}(Y) - \mathbb{H}(X, Y) \\\\
&= \mathbb{H}(Y) - \mathbb{H}(Y | X) \\\\
&= \mathbb{H}(X) - \mathbb{H}(X | Y).
\end{align*}
\]
\[
\begin{align*}
\mathbb{I }(X ; Y) &= \mathbb{H}(X) - \mathbb{H}(X | Y) \\\\
&= \mathbb{H}(X, Y) - \mathbb{H}(X | Y) - \mathbb{H}(Y | X).
\end{align*}
\]