Fisher Information Matrix
The Fisher information matrix is related to the curvature of log
likelihood functions. In frequentist statistics, it characterizes the sampling distribution of the MLE.
First, we define a score function as the gradient of the log likelihood with respect
to the parameter vector \(\theta\):
\[
s(\theta) = \nabla_{\theta} \log p(x | \theta).
\]
The Fisher information matrix (FIM) is defined to be the covariance of the
score function:
\[
F(\theta) = \mathbb{E }_{x \sim p(x|\theta)} [s(\theta)s(\theta)^T].
\]
Theorem 1:
If \( \log p(x |\theta)\) is twice differentiable, and under certain regularity conditions, the
the Fisher information matrix equals the expected Hessian of the negative log likelihood (NLL):
\[
F(\theta) = - \mathbb{E }_{x \sim \theta} [\nabla_{\theta}^2 \log p(x |\theta)].
\]
Proof:
First, the expected value of the score function \(s(\theta)\) is zero. In scalar case, assuming that \(p(x | \theta)\) is differentiable and
the bounds of the integral do not depend on \(\theta\), we have
\[
\begin{align*}
&\int p(x | \theta) dx = 1 \\\\
&\Longrightarrow \frac{\partial}{\partial \theta} \int p(x | \theta) dx = 0 \\\\
&\Longrightarrow \int \Big[\frac{\partial}{\partial \theta} \log p(x | \theta)\Big] p(x | \theta) dx = 0 \tag{1} \\\\
&\Longrightarrow \mathbb{E }[s(\theta)] = 0
\end{align*}
\]
where \(\frac{\partial}{\partial \theta} \log p(x | \theta) = s(\theta)\).
Taking derivatives of Equation (1), by the product rule, we obtain
\[
\begin{align*}
0 &= \frac{\partial}{\partial \theta} \int \Big[\frac{\partial}{\partial \theta} \log p(x | \theta)\Big] p(x | \theta) dx \\\\
&= \int \Big[\frac{\partial^2}{\partial \theta^2} \log p(x | \theta)\Big] p(x | \theta) dx
+ \int \Big[\frac{\partial}{\partial \theta} \log p(x | \theta)\Big] \frac{\partial}{\partial \theta} p(x | \theta) dx \\\\
&= \int \Big[\frac{\partial^2}{\partial \theta^2} \log p(x | \theta)\Big] p(x | \theta) dx
+ \int \Big[\frac{\partial}{\partial \theta} \log p(x | \theta)\Big]^2 p(x | \theta) dx. \\\\
\end{align*}
\]
Therefore,
\[
- \mathbb{E }_{x \sim \theta } \Big[\frac{\partial^2}{\partial \theta^2} \log p(x | \theta)\Big]
= \mathbb{E }_{x \sim \theta } \Big[ \Big(\frac{\partial}{\partial \theta} \log p(x | \theta)\Big)^2 \Big].
\]
FIM for the Exponential Family
Consider an exponential family distribution with natural parameter vector \(\eta \in \mathbb{R}^K\):
\[
p(x | \eta) = h(x) \exp \{\eta^\top \mathcal{T}(x) - A(\eta)\}.
\]
Remember that the gradient of the log partition function \(A(\eta)\) is the
expected sufficient statistics \(\mathcal{T}(\eta)\), which is called the moment parameters \(m\):
\[
\nabla_{\eta} A(\eta) = \mathbb{E }[\mathcal{T}(\eta)] = m.
\]
Also, the gradient of the log likelihood is the sufficient statistics minus their expected value:
\[
\begin{align*}
&\log p(x | \eta) = \log h(x) + \eta^\top \mathcal{T}(x) - A(\eta) \\\\
&\Longrightarrow \nabla_{\eta} \log p(x | \eta) = \mathcal{T}(x) - \mathbb{E }[\mathcal{T}(x)] = \mathcal{T}(x) - m.
\end{align*}
\]
Therefore, the Hessian of the log partition function is the same as the FIM, which is the same as the
covariance of the sufficient statistics:
\[
\begin{align*}
F(\eta) &= - \mathbb{E }_{p(x|\eta)} \Big[\nabla_{\eta}^2 (\eta^\top \mathcal{T}(x) - A(\eta))\Big] \\\\
&= \nabla_{\eta}^2 A(\eta) \\\\
&= \text{Cov }[\mathcal{T}(x)].
\end{align*}
\]
So, the FIM is indeed the second cumulant of the sufficient statistics.
Note: Sometimes, we need FIM with respect to the moment parameters \(m\):
\[
m = \nabla_{\eta} A(\eta) \Longrightarrow \frac {dm}{d\eta} = \nabla_{\eta}^2 A(\eta) = F(\eta)
\]
Here, \(F(\eta)\) is the Jacobian matrix and thus:
\[
F(m) = \frac{d\eta}{dm} = \Big(\frac{dm}{d\eta}\Big)^{-1} = F(\eta)^{-1}.
\]
Natural Gradient Descent
The FIM can be used in optimization.
The natural gradient descent (NGD) is the second order optimization method for the parameters
of (conditional) probability distributions \(p_{\theta}(y | x)\).
(Check: Gradient Descent for basic ideas of optimization problems.)
In NGD, we measure the difference between two probability distributions by the KL divergence.
For any inputs \(x \in \mathbb{R}^n\), we can approximate the KL divergence in terms of the FIM by the
second order Taylor series expansion:
\[
\begin{align*}
D_{\mathbb{KL}}(p_{\theta}(y | x) \, \| \, p_{\theta + \delta} (y|x))
&\approx -\delta^\top \mathbb{E }_{p_{\theta}(y | x)} [\nabla \log p_{\theta}(y | x) ] - \frac{1}{2}\delta^\top \mathbb{E }_{p_{\theta}(y | x)}[\nabla^2 \log p_{\theta}(y | x) ] \delta \\\\
&= 0 - \frac{1}{2}\delta^\top F_x(\theta) \delta \\\\
&= \frac{1}{2}\delta^\top F_x \delta
\end{align*}
\]
where \(\delta\) represents the change in the parameters.
We compute average KL divergence between updated distribution and previous one using
\[
\frac{1}{2}\delta^\top F \delta
\]
where \(F\) is the averaged FIM:
\[
F(\theta) = \mathbb{E }_{p\mathcal{D}(x)}[F_x(\theta)].
\]
In NGD, we use the inverse FIM as a preconditioning matrix and update parameters:
\[
\theta_{t+1} = \theta_{t} - \eta_{t} F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t)
\]
Here, we define the natural gradient:
\[
\widetilde{\nabla} \mathcal{L}(\theta_t) = F(\theta_t)^{-1} \nabla \mathcal{L}(\theta_t) = F^{-1}g_t
\]
Note: \(F\) is always positive definite and is relatively easier to compute and approximate compared to the Hessian matrix.
Jeffreys Prior
In Bayesian statistics, the FIM is used to derive Jeffreys prior, which is a widely used uninformative prior.
It allows the posterior to be driven primarily by the data itself. Given a prior \(p_{\theta}(\theta)\) and a transformation \(\phi = f(\theta)\),
We seek a prior that is invariant under reparameterization. This ensures that inference remains consistent regardless of the choice
of parameterization. The prior should transform as:
\[
p_{\phi}(\phi) = p_{\theta}(\theta) \Big| \frac{d\theta}{d\phi} \Big|
\]
or in multiple dimensions,
\[
p_{\phi}(\phi) = p_{\theta}(\theta) | \det J |
\]
where \(J\) is the Jacobian matrix with entries \(J_{ij} = \frac{\partial\theta_i}{\partial\phi_j}\).
The Jeffreys prior is defined as
\[
p(\theta) \propto \sqrt{F(\theta)}
\]
where \(F\) is the Fisher information.
Or, in multiple dimensions, it has the form
\[
p(\theta) \propto \sqrt{\det F(\theta)}
\]
where \(F\) is the Fisher information matrix.
In 1d case, suppose \(p_{\theta}(\theta) \propto \sqrt{F(\theta)}\). We can derive a prior for \(\phi\) in terms of \(\theta\) as follows:
\[
\begin{align*}
p_{\phi}(\phi) &= p_{\theta}(\theta) \Big| \frac{d\theta}{d\phi} \Big| \\\\
&\propto \sqrt{F(\theta)\Big(\frac{d\theta}{d\phi}\Big)^2} \\\\
&= \sqrt{\mathbb{E }\Big[\Big(\frac{d \log p(x|\theta)}{d\theta}\Big )^2\Big] \Big(\frac{d\theta}{d\phi}\Big)^2}\\\\
&= \sqrt{\mathbb{E }\Big[\Big(\frac{d \log p(x|\theta)}{d\theta}\frac{d\theta}{d\phi}\Big)^2\Big] }\\\\
&= \sqrt{\mathbb{E }\Big[\Big(\frac{d \log p(x|\phi)}{d\phi}\Big)^2\Big] }\\\\
&= \sqrt{F(\phi)}
\end{align*}
\]
So, the Jeffreys prior is invariant to reparameterizations.
Note: The KL divergence is indeed invariant to reparameterizations.
Example:
Consider the Binomial distribution:
\[
X \sim \text{Bin }(n, \theta), \, 0 \leq \theta \leq 1
\]
\[
p(x | \theta) = \binom{n}{x} \theta^x (1-\theta)^{n-x}
\]
Ignoring \(\binom{n}{x}\), its log likelihood is given by
\[
l(\theta | x) \propto x\log \theta + (n-x)\log(1-\theta).
\]
The Fisher information is given by
\[
\begin{align*}
F(\theta) &= -\mathbb{E }_{x \sim \theta}\left[\frac{d^2 l}{d \theta^2}\right] \\\\
&= -\mathbb{E }_{x \sim \theta}\left[-\frac{x}{\theta^2}-\frac{n-x}{(1-\theta)^2}\right] \\\\
&= \frac{n\theta}{\theta^2} + \frac{(1-\theta)n}{(1-\theta)^2} \\\\
&= \frac{n}{\theta(1-\theta)} \\\\
&\propto \theta^{-1}(1-\theta)^{-1}.
\end{align*}
\]
Thus, the Jeffreys prior for the parameter \(\theta\) is given by
\[
\begin{align*}
p_{\theta} (\theta) = \sqrt{F(\theta)} &\propto \theta^{-\frac{1}{2}} (1 - \theta)^{-\frac{1}{2}} \\\\
\end{align*}
\]
Now, consider the parameterization by \(\phi = \frac{\theta}{1 - \theta}\).
Solving this expression for \(\theta\), we get
\[
\theta = \frac{\phi}{\phi+1}.
\]
Then we have
\[
\begin{align*}
p(x|\phi) &\propto \left(\frac{\phi}{\phi+1}\right)^x \left(1 - \frac{\phi}{\phi + 1}\right)^{n-x} \\\\
&= \phi^x (\phi +1)^{-x} (\phi +1)^{-n +x} \\\\
&= \phi^x (\phi + 1)^{-n}
\end{align*}
\]
and the log likelihood is given by
\[
l(\phi | x) = x\log \phi -n\log(\phi + 1).
\]
The Fisher information is given by
\[
\begin{align*}
F(\phi) &= -\mathbb{E }_{x \sim \phi}\left[\frac{d^2 l}{d \phi^2}\right] \\\\
&= -\mathbb{E }_{x \sim \phi}\left[-\frac{x}{\phi^2}+\frac{n}{(\phi + 1)^2}\right] \\\\
&= \frac{n\phi}{\phi + 1}\frac{x}{\phi^2} - \frac{n}{(\phi + 1)^2} \\\\
&= \frac{n(\phi+1)-n\phi}{\phi(\phi+1)^2} \\\\
&= \frac{n}{\phi(\phi+1)^2} \\\\
&\propto \phi^{-1} (\phi+1)^{-2}
\end{align*}
\]
Thus, the Jeffreys prior for the reparameterized variable \(\phi\) is given by:
\[
p_{\phi}(\phi) = \sqrt{F(\phi)} \propto \phi^{-\frac{1}{2}}(\phi +1)^{-1}.
\]
Note: In 1d, the Jeffreys prior is the same as the reference prior, which maximizes
the expected KL divergence between posterior and prior. In other words, it maximizes the information
provided by the "data" relative to the prior. For multidimensional parameters, they are not the same.
Also, finding a reference prior is equivalent to finding the prior that maximizes mutual information
between \(\theta\) and \(\mathcal{D}\).
\[
\begin{align*}
p^*(\theta) &= \arg \max_{p(\theta)} \mathbb{E }_{\mathcal{D}}[D_{\mathbb{KL}}(p(\theta | \mathcal{D}) \,\|\, p(\theta) )] \\\\
&= \arg \max_{p(\theta)}\, \mathbb{I }(\theta ; \mathcal{D})
\end{align*}
\]
For example, in continuous case,
\[
\begin{align*}
\mathbb{I }(\theta \, ; \mathcal{D}) &= \int_{\mathcal{D}} p(\mathcal{D}) D_{\mathbb{KL}}(p(\theta | \mathcal{D}) \,\|\, p(\theta) ) d\mathcal{D} \\\\
&= \int p(\mathcal{D}) \left( \int p(\theta | \mathcal{D}) \log \frac{p(\theta | \mathcal{D})}{p(\theta)}d\theta \right) d\mathcal{D} \\\\
&= \int \int p(\theta|\mathcal{D})p(\mathcal{D}) \log \frac{p(\theta|\mathcal{D})}{p(\theta)} d\theta d\mathcal{D} \\\\
&= \int \int p(\theta, \mathcal{D}) \log \frac{p(\theta, \mathcal{D})}{p(\theta)p(\mathcal{D})}d\theta d\mathcal{D}
\end{align*}
\]
where \(p(\mathcal{D})\) is the marginal likelihood:
\[
p(\mathcal{D}) = \int p(\mathcal{D} | \theta)p(\theta)d\theta
\]
and \(p(\theta, \mathcal{D})\) represents the joint probability distribution between \(\theta\) and \(\mathcal{D}\):
\[
p(\theta, \mathcal{D}) = p(\theta|\mathcal{D})p(\mathcal{D}).
\]
Note: Remember, the mutual information itself is invariant under reparameterization.