Multivariate Normal Distribution
In machine learning, the most important joint probability distribution for continuous random variables is the
multivariate normal distribution (MVN).
The multivariate normal distribution of an \(n\) dimensional random vector \(\boldsymbol{x} \in \mathbb{R}^n \) is denoted as
\[
\boldsymbol{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)
\]
where \(\boldsymbol{\mu} = \mathbb{E} [\boldsymbol{x}] \in \mathbb{R}^n\) is the mean vector, and
\(\Sigma = \text{Cov }[\boldsymbol{x}] \in \mathbb{R}^{n \times n}\) is the covariance matrix.
The p.d.f. is given by
\[
f(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \exp \Big[-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}) \Big]. \tag{1}
\]
The expression inside the exponential (ignoring the factor of -\frac{1}{2}) is the squared Mahalanobis distance between
the data vector \(\boldsymbol{x}\) and the mean vector \(\boldsymbol{\mu}\), given by
\[
d_{\Sigma} (\boldsymbol{x}, \boldsymbol{\mu})^2 = (\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}).
\]
If \(\boldsymbol{x} \in \mathbb{R}^2\), the MVN is known as the bivariate normal distribution.
In this case,
\[
\begin{align*}
\Sigma &= \begin{bmatrix}
\text{Var } (X_1) & \text{Cov }[X_1, X_2] \\
\text{Cov }[X_2, X_1] & \text{Var } (X_2)
\end{bmatrix} \\\\
&= \begin{bmatrix}
\sigma_1^2 & \rho \sigma_1 \sigma_2 \\
\rho \sigma_1 \sigma_2 & \sigma_2^2
\end{bmatrix} \\\\
\end{align*}
\]
where \(\rho\) is the correlation coefficient defined by
\[
\text{Corr }[X_1, X_2] = \frac{\text{Cov }[X_1, X_2]}{\sqrt{\text{Var }(X_1)\text{Var }(X_2)}}.
\]
Then
\[
\begin{align*}
\det (\Sigma) &= \sigma_1^2 \sigma_2^2 - \rho^2 \sigma_1^2 \sigma_2^2 \\\\
&= \sigma_1^2 \sigma_2^2 (1 - \rho^2)
\end{align*}
\]
and
\[
\begin{align*}
\Sigma^{-1} &= \frac{1}{\det (\Sigma )}
\begin{bmatrix}
\sigma_2^2 & -\rho \sigma_1 \sigma_2 \\
-\rho \sigma_1 \sigma_2 & \sigma_1^2
\end{bmatrix} \\\\
&= \frac{1}{1 - \rho^2}
\begin{bmatrix}
\frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\
\frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 }
\end{bmatrix}
\end{align*}
\]
Note that in Expression (1), \((\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu})\) is a quadratic form. So,
\[
\begin{align*}
(\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu})
&= \frac{1}{1 - \rho^2} \begin{bmatrix} X_1 - \mu_1 & X_2 - \mu_2 \end{bmatrix}
\begin{bmatrix}
\frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\
\frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 }
\end{bmatrix}
\begin{bmatrix} X_1 - \mu_1 \\ X_2 - \mu_2 \end{bmatrix} \\\\
&= \frac{1}{1 - \rho^2}\Big[\frac{1}{\sigma_1^2 }(X_1 - \mu_1)^2
-\frac{2\rho} {\sigma_1 \sigma_2}(X_1 - \mu_1)(X_2 - \mu_2)
+\frac{1}{\sigma_2^2 }(X_2 - \mu_2)^2 \Big].
\end{align*}
\]
Therefore, we obtain the p.d.f for the bivariate normal distribution:
\[
f(\boldsymbol{x}) = \frac{1}{2\pi \sigma_1 \sigma_2 \sqrt{(1 - \rho^2)}}
\exp\Big\{-\frac{1}{2(1 - \rho^2)}
\Big[\Big(\frac{X_1 - \mu_1}{\sigma_1}\Big)^2
-2\rho \Big(\frac{X_1 - \mu_1} {\sigma_1}\Big) \Big(\frac{X_2 - \mu_2} {\sigma_2}\Big)
+\Big(\frac{X_2 - \mu_2}{\sigma_2}\Big)^2
\Big]
\Big\}
\]
When \(\rho = -1 \text{ or } 1\), this p.d.f is undefined and \(f\) is said to be degenerate.
Cholesky Decomposition
In the previous section, we introduced the multivariate Gaussian distribution
\( \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \), where the covariance matrix
\( \boldsymbol{\Sigma} \) determines the shape and orientation of the distribution.
To efficiently generate samples from such a distribution, we often need to
transform uncorrelated standard normal samples into correlated ones that follow
\( \boldsymbol{\Sigma} \).
The Cholesky decomposition provides a convenient and numerically stable way
to perform this transformation.
Cholesky Decomposition:
Let \( \boldsymbol{A} \in \mathbb{R}^{d \times d} \) be a
symmetric positive definite matrix.
Then there exists a unique lower triangular matrix \( \boldsymbol{L} \) with positive diagonal
entries such that
\[
\boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top.
\]
(Equivalently, \( \boldsymbol{A} = \boldsymbol{R}^\top \boldsymbol{R} \) where
\( \boldsymbol{R} = \boldsymbol{L}^\top \) is upper triangular.)
Let \( \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \) and apply the
Cholesky decomposition to its covariance matrix:
\[
\boldsymbol{\Sigma} = \boldsymbol{L}\boldsymbol{L}^\top.
\]
If \( \boldsymbol{x} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) \), which can be obtained by sampling
\( d \) independent standard normal variables, then
\[
\boldsymbol{y} = \boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu}
\]
has the desired covariance:
\[
\begin{align*}
\text{Cov}[\boldsymbol{y}]
&= \boldsymbol{L}\,\text{Cov}[\boldsymbol{x}]\,\boldsymbol{L}^\top \\
&= \boldsymbol{L}\boldsymbol{I}\boldsymbol{L}^\top \\
&= \boldsymbol{\Sigma}.
\end{align*}
\]
Proof:
By linearity of expectation,
\[
\mathbb{E}[\boldsymbol{y}]
= \mathbb{E}[\boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu}]
= \boldsymbol{L}\mathbb{E}[\boldsymbol{x}] + \boldsymbol{\mu}
= \boldsymbol{\mu}.
\]
The covariance of a random vector is given by
\[
\text{Cov}[\boldsymbol{y}]
= \mathbb{E}\!\left[(\boldsymbol{y} - \mathbb{E}[\boldsymbol{y}])
(\boldsymbol{y} - \mathbb{E}[\boldsymbol{y}])^\top \right].
\]
Substituting \( \boldsymbol{y} = \boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu} \) and using
\( \mathbb{E}[\boldsymbol{x}] = \boldsymbol{0} \),
\[
\begin{align*}
\text{Cov}[\boldsymbol{y}]
&= \mathbb{E}\!\left[ (\boldsymbol{L}\boldsymbol{x})
(\boldsymbol{L}\boldsymbol{x})^\top \right] \\\\
&= \boldsymbol{L}\,\mathbb{E}\!\left[\boldsymbol{x}\boldsymbol{x}^\top\right]\boldsymbol{L}^\top \\\\
&= \boldsymbol{L}\,\text{Cov}[\boldsymbol{x}]\,\boldsymbol{L}^\top \\\\
&= \boldsymbol{L}\boldsymbol{I}\boldsymbol{L}^\top \\\\
&= \boldsymbol{L}\boldsymbol{L}^\top \\\\
&= \boldsymbol{\Sigma}.
\end{align*}
\]
Therefore, \( \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{L}\boldsymbol{L}^\top) \).
Dirichlet Distribution
The Dirichlet distribution is a multivariate generalization of beta distribution.
It has support over the the \((K - 1)\)-dimensional probability simplex, defined by
\[
S_K = \Bigl\{(x_1, x_2, \dots, x_K) \in \mathbb{R}^K: x_k \ge 0,\ \sum_{k=1}^K x_k = 1 \Bigr\}.
\]
A random vector \(\boldsymbol{x} \in \mathbb{R}^K\) is said to have a Dirichlet distribution with parameters
\(\boldsymbol{\alpha} = (\alpha_1, \alpha_2, \dots, \alpha_K)\) (with each \(\alpha_k > 0\)) if its probability
density function is given by
\[
f(x_1, \dots, x_K; \boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1}, \quad (x_1, \dots, x_K) \in S_K,
\]
or
\[
\text{Dir }(\boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1} \mathbb{I}(\boldsymbol{x} \in S_k),
\]
where the multivariate beta function \(B(\boldsymbol{\alpha})\) is defined as
\[
B(\boldsymbol{\alpha}) = \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma\Bigl(\sum_{k=1}^K \alpha_k\Bigr)}.
\]
Moments:
Let \(\alpha_0 = \sum_{k=1}^K \alpha_k\). Then for each \(k\),
- Mean:
\[
\mathbb{E}[x_k] = \frac{\alpha_k}{\alpha_0}.
\]
- Variance:
\[
\operatorname{Var}[x_k] = \frac{\alpha_k (\alpha_0 - \alpha_k)}{\alpha_0^2 (\alpha_0+1)}.
\]
Note: Often we use a symmetric Dirichlet prior of the \(\alpha_k = \frac{\alpha}{K}\). Then
\[
\mathbb{E}[x_k] = \frac{1}{K}, \quad \operatorname{Var}[x_k] = \frac{K-1}{K^2 (\alpha +1)}.
\]
We can see that increasing \(\alpha\) increases the precision(decreases the variance) of the distribution.
- Covariance: For \(i \neq j\),
\[
\operatorname{Cov}[x_i, x_j] = \frac{-\alpha_i \alpha_j}{\alpha_0^2 (\alpha_0+1)}.
\]
The parameters \(\alpha_k\) can be thought of as "pseudocounts" or prior observations of each category. When all
\(\alpha_k\) are equal (i.e., \(\boldsymbol{\alpha} = \alpha\, \mathbf{1}\)), the distribution is said to be uniform
over the simplex when \(\alpha = 1\) or symmetric if \(\alpha \ne 1\). This symmetry makes the Dirichlet distribution a
natural prior in Bayesian models where no category is favored a priori.
The Dirichlet distribution is widely used as a conjugate prior for the parameters
of a multinomial distribution in Bayesian statistics, as well as in machine learning models
such as latent Dirichlet allocation (LDA) for topic modeling.
Wishart Distribution
The Wishart distribution is a fundamental multivariate distribution that generalizes the gamma distribution to
positive definite matrices.
The p.d.f. of Wishart distribution is defined as follows:
\[
\operatorname{Wi}(\boldsymbol{\Sigma} | \boldsymbol{S}, \nu)
= \frac{1}{Z} |\boldsymbol{\Sigma}| ^{(\nu-D-1)/2} \exp \Bigl( - \frac{1}{2} \text{ tr } (\boldsymbol{S}^{-1} \boldsymbol{\Sigma})\Bigr)
\]
where
\[
Z = 2^{\nu D/2} | \boldsymbol{S} |^{-\nu / 2}\,\Gamma_D (\frac{\nu}{2}).
\]
Here, \(\Gamma_D(\cdot)\) denotes the multivariate gamma function. The parameters are:
- \(\nu\): the degrees of freedom (which must satisfy \(\nu \ge D\) for the distribution to be well-defined),
- \(\boldsymbol{S}\): the scale matrix (a \(D \times D\) positive definite matrix).
Note: \(\text{ mean } = \nu \boldsymbol{S}\).
If \(D=1\), ir reduces to the gamma distribution:
\[
\operatorname{Wi}(\lambda, s^{-1}, \nu) = \operatorname{Gamma}(\lambda | \text{ shape } = \frac{\nu}{2}, \text{ rate } = \frac{1}{2s}).
\]
If we set \(s=2\), this further reduces to the chi-squared distribution.
Applications:
- Covariance Matrix Estimation:
In multivariate statistics, the Wishart distribution is used to model the uncertainty in covariance matrix estimates.
Given \(D\) i.i.d. samples from \(\mathcal{N}(0, \boldsymbol{\Sigma})\), the sample covariance matrix
\[
\boldsymbol{S} = \sum_{i=1}^D x_i {x_i}^T
\]
follows a Wishart distribution:
\[
\boldsymbol{S} \sim \operatorname{Wi}( \boldsymbol{\Sigma}, D).
\]
- Bayesian Inference:
It serves as a conjugate prior for the covariance matrix in multivariate normal models, enabling closed-form
updates of the posterior distribution. We often use the inverse Wishart distribution:
For \(\nu > D -1\) and \(\boldsymbol{S} \succ 0\),
\[
\operatorname{IW}(\boldsymbol{\Sigma} | \boldsymbol{S}^{-1}, \nu)
= \frac{1}{Z_{IW}} |\boldsymbol{\Sigma}| ^{-(\nu+D+1)/2} \exp \Bigl( - \frac{1}{2} \text{ tr } (\boldsymbol{S} \boldsymbol{\Sigma}^{-1})\Bigr)
\]
where
\[
Z_{IW} = 2^{\nu D/2} | \boldsymbol{S} |^{\nu / 2}\,\Gamma_D (\frac{\nu}{2}).
\]
Then
\[
\boldsymbol{\Sigma} \sim \operatorname{IW}( \boldsymbol{S}, \nu).
\]
(Remember, \(\lambda \sim \text{Gamma }(a, b)\) then \(\frac{1}{\lambda} \sim \operatorname{IG}(a,b)\).
So, similarly, the IW is multivariate generalization of the IG.)
Note: One is about sampling properties of the estimator (frequentist), and the other is about prior beliefs and posterior
inference (Bayesian). In practice, both deal with uncertainty in covariance matrices, and the same distribution arises in
both roles — just with different interpretations.