Multivariate Distributions

Multivariate Normal Distribution Dirichlet Distribution Wishart distribution

Multivariate Normal Distribution

In machine learning, the most important joint probability distribution for continuous random variables is the multivariate normal distribution (MVN). The multivariate normal distribution of a \(n\) dimentional random vector \(\boldsymbol{x} \in \mathbb{R}^n \) is denoted as \[ \boldsymbol{X} \sim \mathcal{N}(\boldsymbol{\mu}, \Sigma) \] where \(\boldsymbol{\mu} = \mathbb{E} [\boldsymbol{x}] \in \mathbb{R}^n\) is the mean vector, and \(\Sigma = \text{Cov }[\boldsymbol{x}] \in \mathbb{R}^{n \times n}\) is the covariance matrix.
The p.d.f. is given by \[ f(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^n \det(\Sigma)}} \exp \Big[-\frac{1}{2}(\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}) \Big]. \tag{1} \] The expression inside the exponential (ignoring the factor of -\frac{1}{2}) is the squared Mahalanobis distance between the data vector \(\boldsymbol{x}\) and the mean vector \(\boldsymbol{\mu}\), given by \[ d_{\Sigma} (\boldsymbol{x}, \boldsymbol{\mu})^2 = (\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}). \]
If \(\boldsymbol{x} \in \mathbb{R}^2\), the MVN is known as the bivariate normal distribution. In this case, \[ \begin{align*} \Sigma &= \begin{bmatrix} \text{Var } (X_1) & \text{Cov }[X_1, X_2] \\ \text{Cov }[X_2, X_1] & \text{Var } (X_2) \end{bmatrix} \\\\ &= \begin{bmatrix} \sigma_1^2 & \rho \sigma_1 \sigma_2 \\ \rho \sigma_1 \sigma_2 & \sigma_2^2 \end{bmatrix} \\\\ \end{align*} \] where \(\rho\) is the correlation coefficient defined by \[ \text{Corr }[X_1, X_2] = \frac{\text{Cov }[X_1, X_2]}{\sqrt{\text{Var }(X_1)\text{Var }(X_2)}}. \] Then \[ \begin{align*} \det (\Sigma) &= \sigma_1^2 \sigma_2^2 - \rho^2 \sigma_1^2 \sigma_2^2 \\\\ &= \sigma_1^2 \sigma_2^2 (1 - \rho^2) \end{align*} \] and \[ \begin{align*} \Sigma^{-1} &= \frac{1}{\det (\Sigma )} \begin{bmatrix} \sigma_2^2 & -\rho \sigma_1 \sigma_2 \\ -\rho \sigma_1 \sigma_2 & \sigma_1^2 \end{bmatrix} \\\\ &= \frac{1}{1 - \rho^2} \begin{bmatrix} \frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\ \frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 } \end{bmatrix} \end{align*} \] Note that in Expression (1), \((\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu})\) is a quadratic form. So, \[ \begin{align*} (\boldsymbol{x} - \boldsymbol{\mu})^T \Sigma^{-1} (\boldsymbol{x} -\boldsymbol{\mu}) &= \frac{1}{1 - \rho^2} \begin{bmatrix} X_1 - \mu_1 & X_2 - \mu_2 \end{bmatrix} \begin{bmatrix} \frac{1}{\sigma_1^2 } & \frac{-\rho} {\sigma_1 \sigma_2} \\ \frac{-\rho} {\sigma_1 \sigma_2} & \frac{1}{\sigma_2^2 } \end{bmatrix} \begin{bmatrix} X_1 - \mu_1 \\ X_2 - \mu_2 \end{bmatrix} \\\\ &= \frac{1}{1 - \rho^2}\Big[\frac{1}{\sigma_1^2 }(X_1 - \mu_1)^2 -\frac{2\rho} {\sigma_1 \sigma_2}(X_1 - \mu_1)(X_2 - \mu_2) +\frac{1}{\sigma_2^2 }(X_2 - \mu_2)^2 \Big]. \end{align*} \] Therefore, we obtain the p.d.f for the bivariate normal distribution: \[ f(\boldsymbol{x}) = \frac{1}{2\pi \sigma_1 \sigma_2 \sqrt{(1 - \rho^2)}} \exp\Big\{-\frac{1}{2(1 - \rho^2)} \Big[\Big(\frac{X_1 - \mu_1}{\sigma_1}\Big)^2 -2\rho \Big(\frac{X_1 - \mu_1} {\sigma_1}\Big) \Big(\frac{X_2 - \mu_2} {\sigma_2}\Big) +\Big(\frac{X_2 - \mu_2}{\sigma_2}\Big)^2 \Big] \Big\} \] When \(\rho = -1 \text{ or } 1\), this p.d.f is undefined and \(f\) is said to be degenerate.

Dirichlet Distribution

The Dirichlet distribution is a multivariate generalization of beta distribution. It has support over the the \((K - 1)\)-dimensional probability simplex, defined by \[ S_K = \Bigl\{(x_1, x_2, \dots, x_K) \in \mathbb{R}^K: x_k \ge 0,\ \sum_{k=1}^K x_k = 1 \Bigr\}. \] A random vector \(\boldsymbol{x} \in \mathbb{R}^K\) is said to have a Dirichlet distribution with parameters \(\boldsymbol{\alpha} = (\alpha_1, \alpha_2, \dots, \alpha_K)\) (with each \(\alpha_k > 0\)) if its probability density function is given by \[ f(x_1, \dots, x_K; \boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1}, \quad (x_1, \dots, x_K) \in S_K, \] or \[ \text{Dir }(\boldsymbol{\alpha}) = \frac{1}{B(\boldsymbol{\alpha})} \prod_{k=1}^K x_k^{\alpha_k - 1} \mathbb{I}(\boldsymbol{x} \in S_k), \] where the multivariate beta function \(B(\boldsymbol{\alpha})\) is defined as \[ B(\boldsymbol{\alpha}) = \frac{\prod_{k=1}^K \Gamma(\alpha_k)}{\Gamma\Bigl(\sum_{k=1}^K \alpha_k\Bigr)}. \]

Moments: Let \(\alpha_0 = \sum_{k=1}^K \alpha_k\). Then for each \(k\),
  • Mean: \[ \mathbb{E}[x_k] = \frac{\alpha_k}{\alpha_0}. \]
  • Variance: \[ \operatorname{Var}[x_k] = \frac{\alpha_k (\alpha_0 - \alpha_k)}{\alpha_0^2 (\alpha_0+1)}. \]
  • Note: Often we use a symmetric Dirichlet prior of the \(\alpha_k = \frac{\alpha}{K}\). Then \[ \mathbb{E}[x_k] = \frac{1}{K}, \quad \operatorname{Var}[x_k] = \frac{K-1}{K^2 (\alpha +1)}. \] We can see that increasing \(\alpha\) increases the precision(decreases the variance) of the distribution.
  • Covariance: For \(i \neq j\), \[ \operatorname{Cov}[x_i, x_j] = \frac{-\alpha_i \alpha_j}{\alpha_0^2 (\alpha_0+1)}. \]
The parameters \(\alpha_k\) can be thought of as "pseudocounts" or prior observations of each category. When all \(\alpha_k\) are equal (i.e., \(\boldsymbol{\alpha} = \alpha\, \mathbf{1}\)), the distribution is said to be uniform over the simplex when \(\alpha = 1\) or symmetric if \(\alpha \ne 1\). This symmetry makes the Dirichlet distribution a natural prior in Bayesian models where no category is favored a priori.

The Dirichlet distribution is widely used as a conjugate prior for the parameters of a multinomial distribution in Bayesian statistics, as well as in machine learning models such as latent Dirichlet allocation (LDA) for topic modeling.

Wishart Distribution

The Wishart distribution is a fundamental multivariate distribution that generalizes the gamma distribution to positive definite matrices. The p.d.f. of Wishart distribution is defined as follows: \[ \operatorname{Wi}(\boldsymbol{\Sigma} | \boldsymbol{S}, \nu) = \frac{1}{Z} |\boldsymbol{\Sigma}| ^{(\nu-D-1)/2} \exp \Bigl( - \frac{1}{2} \text{ tr } (\boldsymbol{S}^{-1} \boldsymbol{\Sigma})\Bigr) \] where \[ Z = 2^{\nu D/2} | \boldsymbol{S} |^{-\nu / 2}\,\Gamma_D (\frac{\nu}{2}). \] Here, \(\Gamma_D(\cdot)\) denotes the multivariate gamma function. The parameters are:

Note: \(\text{ mean } = \nu \boldsymbol{S}\).

If \(D=1\), ir reduces to the gamma distribution: \[ \operatorname{Wi}(\lambda, s^{-1}, \nu) = \operatorname{Gamma}(\lambda | \text{ shape } = \frac{\nu}{2}, \text{ rate } = \frac{1}{2s}). \] If we set \(s=2\), this further reduces to the chi-squared distribution.

Applications: Note: One is about sampling properties of the estimator (frequentist), and the other is about prior beliefs and posterior inference (Bayesian). In practice, both deal with uncertainty in covariance matrices, and the same distribution arises in both roles — just with different interpretations.