Multivariate Normal Distribution
In machine learning, the most important joint probability distribution for continuous random variables is the
multivariate normal distribution (MVN).
The multivariate normal distribution of a \(n\) dimentional random vector \(\boldsymbol{x} \in \mathbb{R}^n \) is denoted as
\[
\boldsymbol{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma)
\]
where \(\boldsymbol{\mu} = \mathbb{E} [\boldsymbol{x}] \in \mathbb{R}^n\) is the mean vector, and
\(\Sigma = \text{Cov }[\boldsymbol{x}] \in \mathbb{R}^{n \times n}\) is the covariance matrix.
The p.d.f. is given by
\[
f(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \exp \Big[-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}) \Big]. \tag{1}
\]
The expression inside the exponential (ignoring the factor of -\frac{1}{2}) is the squared Mahalanobis distance between
the data vector \(\boldsymbol{x}\) and the mean vector \(\boldsymbol{\mu}\), given by
\[
d_{\Sigma} (\boldsymbol{x}, \boldsymbol{\mu})^2 = (\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}).
\]
If \(\boldsymbol{x} \in \mathbb{R}^2\), the MVN is known as the bivariate normal distribution.
In this case,
\[
\begin{align*}
\Sigma &= \begin{bmatrix}
\text{Var } (X_1) & \text{Cov }[X_1, X_2] \\
\text{Cov }[X_2, X_1] & \text{Var } (X_2)
\end{bmatrix} \\\\
&= \begin{bmatrix}
\sigma_1^2 & \rho \sigma_1 \sigma_2 \\
\rho \sigma_1 \sigma_2 & \sigma_2^2
\end{bmatrix} \\\\
\end{align*}
\]
where \(\rho\) is the correlation coefficient defined by
\[
\text{Corr }[X_1, X_2] = \frac{\text{Cov }[X_1, X_2]}{\sqrt{\text{Var }(X_1)\text{Var }(X_2)}}.
\]
Then
\[
\begin{align*}
\det (\Sigma) &= \sigma_1^2 \sigma_2^2 - \rho^2 \sigma_1^2 \sigma_2^2 \\\\
&= \sigma_1^2 \sigma_2^2 (1 - \rho^2)
\end{align*}
\]
and
\[
\begin{align*}
\Sigma^{-1} &= \frac{1}{\det (\Sigma )}
\begin{bmatrix}
\sigma_2^2 & -\rho \sigma_1 \sigma_2 \\
-\rho \sigma_1 \sigma_2 & \sigma_1^2
\end{bmatrix} \\\\
&= \frac{1}{1 - \rho^2}
\begin{bmatrix}
\frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\
\frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 }
\end{bmatrix}
\end{align*}
\]
Note that in Expression (1), \((\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu})\) is a quadratic form. So,
\[
\begin{align*}
(\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu})
&= \frac{1}{1 - \rho^2} \begin{bmatrix} X_1 - \mu_1 & X_2 - \mu_2 \end{bmatrix}
\begin{bmatrix}
\frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\
\frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 }
\end{bmatrix}
\begin{bmatrix} X_1 - \mu_1 \\ X_2 - \mu_2 \end{bmatrix} \\\\
&= \frac{1}{1 - \rho^2}\Big[\frac{1}{\sigma_1^2 }(X_1 - \mu_1)^2
-\frac{2\rho} {\sigma_1 \sigma_2}(X_1 - \mu_1)(X_2 - \mu_2)
+\frac{1}{\sigma_2^2 }(X_2 - \mu_2)^2 \Big].
\end{align*}
\]
Therefore, we obtain the p.d.f for the bivariate normal distribution:
\[
f(\boldsymbol{x}) = \frac{1}{2\pi \sigma_1 \sigma_2 \sqrt{(1 - \rho^2)}}
\exp\Big\{-\frac{1}{2(1 - \rho^2)}
\Big[\Big(\frac{X_1 - \mu_1}{\sigma_1}\Big)^2
-2\rho \Big(\frac{X_1 - \mu_1} {\sigma_1}\Big) \Big(\frac{X_2 - \mu_2} {\sigma_2}\Big)
+\Big(\frac{X_2 - \mu_2}{\sigma_2}\Big)^2
\Big]
\Big\}
\]
When \(\rho = -1 \text{ or } 1\), this p.d.f is undefined and \(f\) is said to be degenerate.