Multivariate Normal Distribution
In machine learning, the most important joint probability distribution for continuous random variables is the
multivariate normal distribution (MVN).
The multivariate normal distribution of an \(n\) dimensional random vector \(\boldsymbol{x} \in \mathbb{R}^n \) is denoted as
\[
\boldsymbol{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)
\]
where \(\boldsymbol{\mu} = \mathbb{E} [\boldsymbol{x}] \in \mathbb{R}^n\) is the mean vector, and
\(\Sigma = \text{Cov }[\boldsymbol{x}] \in \mathbb{R}^{n \times n}\) is the covariance matrix.
The p.d.f. is given by
\[
f(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \exp \Big[-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}) \Big]. \tag{1}
\]
The expression inside the exponential (ignoring the factor of -\frac{1}{2}) is the squared Mahalanobis distance between
the data vector \(\boldsymbol{x}\) and the mean vector \(\boldsymbol{\mu}\), given by
\[
d_{\Sigma} (\boldsymbol{x}, \boldsymbol{\mu})^2 = (\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}).
\]
If \(\boldsymbol{x} \in \mathbb{R}^2\), the MVN is known as the bivariate normal distribution.
In this case,
\[
\begin{align*}
\Sigma &= \begin{bmatrix}
\text{Var } (X_1) & \text{Cov }[X_1, X_2] \\
\text{Cov }[X_2, X_1] & \text{Var } (X_2)
\end{bmatrix} \\\\
&= \begin{bmatrix}
\sigma_1^2 & \rho \sigma_1 \sigma_2 \\
\rho \sigma_1 \sigma_2 & \sigma_2^2
\end{bmatrix} \\\\
\end{align*}
\]
where \(\rho\) is the correlation coefficient defined by
\[
\text{Corr }[X_1, X_2] = \frac{\text{Cov }[X_1, X_2]}{\sqrt{\text{Var }(X_1)\text{Var }(X_2)}}.
\]
Then
\[
\begin{align*}
\det (\Sigma) &= \sigma_1^2 \sigma_2^2 - \rho^2 \sigma_1^2 \sigma_2^2 \\\\
&= \sigma_1^2 \sigma_2^2 (1 - \rho^2)
\end{align*}
\]
and
\[
\begin{align*}
\Sigma^{-1} &= \frac{1}{\det (\Sigma )}
\begin{bmatrix}
\sigma_2^2 & -\rho \sigma_1 \sigma_2 \\
-\rho \sigma_1 \sigma_2 & \sigma_1^2
\end{bmatrix} \\\\
&= \frac{1}{1 - \rho^2}
\begin{bmatrix}
\frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\
\frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 }
\end{bmatrix}
\end{align*}
\]
Note that in Expression (1), \((\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu})\) is a quadratic form. So,
\[
\begin{align*}
(\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu})
&= \frac{1}{1 - \rho^2} \begin{bmatrix} X_1 - \mu_1 & X_2 - \mu_2 \end{bmatrix}
\begin{bmatrix}
\frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\
\frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 }
\end{bmatrix}
\begin{bmatrix} X_1 - \mu_1 \\ X_2 - \mu_2 \end{bmatrix} \\\\
&= \frac{1}{1 - \rho^2}\Big[\frac{1}{\sigma_1^2 }(X_1 - \mu_1)^2
-\frac{2\rho} {\sigma_1 \sigma_2}(X_1 - \mu_1)(X_2 - \mu_2)
+\frac{1}{\sigma_2^2 }(X_2 - \mu_2)^2 \Big].
\end{align*}
\]
Therefore, we obtain the p.d.f for the bivariate normal distribution:
\[
f(\boldsymbol{x}) = \frac{1}{2\pi \sigma_1 \sigma_2 \sqrt{(1 - \rho^2)}}
\exp\Big\{-\frac{1}{2(1 - \rho^2)}
\Big[\Big(\frac{X_1 - \mu_1}{\sigma_1}\Big)^2
-2\rho \Big(\frac{X_1 - \mu_1} {\sigma_1}\Big) \Big(\frac{X_2 - \mu_2} {\sigma_2}\Big)
+\Big(\frac{X_2 - \mu_2}{\sigma_2}\Big)^2
\Big]
\Big\}
\]
When \(\rho = -1 \text{ or } 1\), this p.d.f is undefined and \(f\) is said to be degenerate.