Multivariate Distributions

Multivariate Normal Distribution

In machine learning, the most important joint probability distribution for continuous random variables is the multivariate normal distribution (MVN). The multivariate normal distribution of an \(n\) dimensional random vector \(\boldsymbol{x} \in \mathbb{R}^n \) is denoted as \[ \boldsymbol{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) \] where \(\boldsymbol{\mu} = \mathbb{E} [\boldsymbol{x}] \in \mathbb{R}^n\) is the mean vector, and \(\Sigma = \text{Cov }[\boldsymbol{x}] \in \mathbb{R}^{n \times n}\) is the covariance matrix.
The p.d.f. is given by \[ f(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \exp \Big[-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}) \Big]. \tag{1} \] The expression inside the exponential (ignoring the factor of -\frac{1}{2}) is the squared Mahalanobis distance between the data vector \(\boldsymbol{x}\) and the mean vector \(\boldsymbol{\mu}\), given by \[ d_{\Sigma} (\boldsymbol{x}, \boldsymbol{\mu})^2 = (\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}). \]
If \(\boldsymbol{x} \in \mathbb{R}^2\), the MVN is known as the bivariate normal distribution. In this case, \[ \begin{align*} \Sigma &= \begin{bmatrix} \text{Var } (X_1) & \text{Cov }[X_1, X_2] \\ \text{Cov }[X_2, X_1] & \text{Var } (X_2) \end{bmatrix} \\\\ &= \begin{bmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{bmatrix} \\\\ \end{align*} \] where \(\rho\) is the correlation coefficient defined by \[ \text{Corr }[X_1, X_2] = \frac{\text{Cov }[X_1, X_2]}{\sqrt{\text{Var }(X_1)\text{Var }(X_2)}}. \] Then \[ \begin{align*} \det (\Sigma) &= \sigma_1^2 \sigma_2^2 - \rho^2 \sigma_1^2 \sigma_2^2 \\\\ &= \sigma_1^2 \sigma_2^2 (1 - \rho^2) \end{align*} \] and \[ \begin{align*} \Sigma^{-1} &= \frac{1}{\det (\Sigma )} \begin{bmatrix} \sigma_2^2 & -\rho \sigma_1 \sigma_2 \\ -\rho \sigma_1 \sigma_2 & \sigma_1^2 \end{bmatrix} \\\\ &= \frac{1}{1 - \rho^2} \begin{bmatrix} \frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\ \frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 } \end{bmatrix} \end{align*} \] Note that in Expression (1), \((\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu})\) is a quadratic form. So, \[ \begin{align*} (\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}) &= \frac{1}{1 - \rho^2} \begin{bmatrix} X_1 - \mu_1 & X_2 - \mu_2 \end{bmatrix} \begin{bmatrix} \frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\ \frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 } \end{bmatrix} \begin{bmatrix} X_1 - \mu_1 \\ X_2 - \mu_2 \end{bmatrix} \\\\ &= \frac{1}{1 - \rho^2}\Big[\frac{1}{\sigma_1^2 }(X_1 - \mu_1)^2 -\frac{2\rho} {\sigma_1 \sigma_2}(X_1 - \mu_1)(X_2 - \mu_2) +\frac{1}{\sigma_2^2 }(X_2 - \mu_2)^2 \Big]. \end{align*} \] Therefore, we obtain the p.d.f for the bivariate normal distribution: \[ f(\boldsymbol{x}) = \frac{1}{2\pi \sigma_1 \sigma_2 \sqrt{(1 - \rho^2)}} \exp\Big\{-\frac{1}{2(1 - \rho^2)} \Big[\Big(\frac{X_1 - \mu_1}{\sigma_1}\Big)^2 -2\rho \Big(\frac{X_1 - \mu_1} {\sigma_1}\Big) \Big(\frac{X_2 - \mu_2} {\sigma_2}\Big) +\Big(\frac{X_2 - \mu_2}{\sigma_2}\Big)^2 \Big] \Big\} \] When \(\rho = -1 \text{ or } 1\), this p.d.f is undefined and \(f\) is said to be degenerate.

Cholesky Decomposition

In the previous section, we introduced the multivariate Gaussian distribution \( \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \), where the covariance matrix \( \boldsymbol{\Sigma} \) determines the shape and orientation of the distribution. To efficiently generate samples from such a distribution, we often need to transform uncorrelated standard normal samples into correlated ones that follow \( \boldsymbol{\Sigma} \). The Cholesky decomposition provides a convenient and numerically stable way to perform this transformation.

Cholesky Decomposition:

Let \( \boldsymbol{A} \in \mathbb{R}^{d \times d} \) be a symmetric positive definite matrix. Then there exists a unique lower triangular matrix \( \boldsymbol{L} \) with positive diagonal entries such that \[ \boldsymbol{A} = \boldsymbol{L}\boldsymbol{L}^\top. \] (Equivalently, \( \boldsymbol{A} = \boldsymbol{R}^\top \boldsymbol{R} \) where \( \boldsymbol{R} = \boldsymbol{L}^\top \) is upper triangular.)

Let \( \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}) \) and apply the Cholesky decomposition to its covariance matrix: \[ \boldsymbol{\Sigma} = \boldsymbol{L}\boldsymbol{L}^\top. \] If \( \boldsymbol{x} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{I}) \), which can be obtained by sampling \( d \) independent standard normal variables, then \[ \boldsymbol{y} = \boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu} \] has the desired covariance: \[ \begin{align*} \text{Cov}[\boldsymbol{y}] &= \boldsymbol{L}\,\text{Cov}[\boldsymbol{x}]\,\boldsymbol{L}^\top \\ &= \boldsymbol{L}\boldsymbol{I}\boldsymbol{L}^\top \\ &= \boldsymbol{\Sigma}. \end{align*} \]

Proof:

By linearity of expectation, \[ \mathbb{E}[\boldsymbol{y}] = \mathbb{E}[\boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu}] = \boldsymbol{L}\mathbb{E}[\boldsymbol{x}] + \boldsymbol{\mu} = \boldsymbol{\mu}. \] The covariance of a random vector is given by \[ \text{Cov}[\boldsymbol{y}] = \mathbb{E}\!\left[(\boldsymbol{y} - \mathbb{E}[\boldsymbol{y}]) (\boldsymbol{y} - \mathbb{E}[\boldsymbol{y}])^\top \right]. \] Substituting \( \boldsymbol{y} = \boldsymbol{L}\boldsymbol{x} + \boldsymbol{\mu} \) and using \( \mathbb{E}[\boldsymbol{x}] = \boldsymbol{0} \), \[ \begin{align*} \text{Cov}[\boldsymbol{y}] &= \mathbb{E}\!\left[ (\boldsymbol{L}\boldsymbol{x}) (\boldsymbol{L}\boldsymbol{x})^\top \right] \\\\ &= \boldsymbol{L}\,\mathbb{E}\!\left[\boldsymbol{x}\boldsymbol{x}^\top\right]\boldsymbol{L}^\top \\\\ &= \boldsymbol{L}\,\text{Cov}[\boldsymbol{x}]\,\boldsymbol{L}^\top \\\\ &= \boldsymbol{L}\boldsymbol{I}\boldsymbol{L}^\top \\\\ &= \boldsymbol{L}\boldsymbol{L}^\top \\\\ &= \boldsymbol{\Sigma}. \end{align*} \] Therefore, \( \boldsymbol{y} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{L}\boldsymbol{L}^\top) \).

Dirichlet Distribution

The Dirichlet distribution is a multivariate generalization of beta distribution. It has support over the the \((K - 1)\)-dimensional probability simplex, defined by \[ S_K = \Bigl\{(x_1, x_2, \dots, x_K) \in \mathbb{R}^K: x_k \ge 0,\ \sum_{k=1}^K x_k = 1 \Bigr\}. \] A random vector \(\boldsymbol{x} \in \mathbb{R}^K\) is said to have a Dirichlet distribution with parameters \(\boldsymbol{\alpha} = (\alpha_1, \alpha_2, \dots, \alpha_K)\) (with each \(\alpha_k > 0\)) if its probability density function is given by \[ f(x_1, \dots, x_K; \boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1}, \quad (x_1, \dots, x_K) \in S_K, \] or \[ \text{Dir }(\boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1} \mathbb{I}(\boldsymbol{x} \in S_k), \] where the multivariate beta function \(B(\boldsymbol{\alpha})\) is defined as \[ B(\boldsymbol{\alpha}) = \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma\Bigl(\sum_{k=1}^K \alpha_k\Bigr)}. \]

Moments: Let \(\alpha_0 = \sum_{k=1}^K \alpha_k\). Then for each \(k\),

Mean: \[ \mathbb{E}[x_k] = \frac{\alpha_k}{\alpha_0}. \]
Variance: \[ \operatorname{Var}[x_k] = \frac{\alpha_k (\alpha_0 - \alpha_k)}{\alpha_0^2 (\alpha_0+1)}. \]
Covariance: For \(i \neq j\), \[ \operatorname{Cov}[x_i, x_j] = \frac{-\alpha_i \alpha_j}{\alpha_0^2 (\alpha_0+1)}. \]

The parameters \(\alpha_k\) can be thought of as "pseudocounts" or prior observations of each category. When all \(\alpha_k\) are equal (i.e., \(\boldsymbol{\alpha} = \alpha\, \mathbf{1}\)), the distribution is said to be uniform over the simplex when \(\alpha = 1\) or symmetric if \(\alpha \ne 1\). This symmetry makes the Dirichlet distribution a natural prior in Bayesian models where no category is favored a priori.

The Dirichlet distribution is widely used as a conjugate prior for the parameters of a multinomial distribution in Bayesian statistics, as well as in machine learning models such as latent Dirichlet allocation (LDA) for topic modeling.

Wishart Distribution

The Wishart distribution is a fundamental multivariate distribution that generalizes the gamma distribution to positive definite matrices. The p.d.f. of Wishart distribution is defined as follows: \[ \operatorname{Wi}(\boldsymbol{\Sigma} | \boldsymbol{S}, \nu) = \frac{1}{Z} |\boldsymbol{\Sigma}| ^{(\nu-D-1)/2} \exp \Bigl( - \frac{1}{2} \text{ tr } (\boldsymbol{S}^{-1} \boldsymbol{\Sigma})\Bigr) \] where \[ Z = 2^{\nu D/2} | \boldsymbol{S} |^{-\nu / 2}\,\Gamma_D (\frac{\nu}{2}). \] Here, \(\Gamma_D(\cdot)\) denotes the multivariate gamma function. The parameters are:

\(\nu\): the degrees of freedom (which must satisfy \(\nu \ge D\) for the distribution to be well-defined),
\(\boldsymbol{S}\): the scale matrix (a \(D \times D\) positive definite matrix).

Note: \(\text{ mean } = \nu \boldsymbol{S}\).

If \(D=1\), ir reduces to the gamma distribution: \[ \operatorname{Wi}(\lambda, s^{-1}, \nu) = \operatorname{Gamma}(\lambda | \text{ shape } = \frac{\nu}{2}, \text{ rate } = \frac{1}{2s}). \] If we set \(s=2\), this further reduces to the chi-squared distribution.

Applications:

Covariance Matrix Estimation:

covariance matrix

Bayesian Inference:

inverse Wishart distribution

Note: One is about sampling properties of the estimator (frequentist), and the other is about prior beliefs and posterior inference (Bayesian). In practice, both deal with uncertainty in covariance matrices, and the same distribution arises in both roles — just with different interpretations.

Probability & Statistics

Multivariate Normal Distribution

Cholesky Decomposition

Dirichlet Distribution

Wishart Distribution