The Exponential Family
Actually, all members of the exponential family have a conjugate prior. Before diving into Bayesian
statistics deeper, we discuss this important family of distributions.
The exponential family is a family of probability distributions parameterized by
natural parameters(or canonical parameters) \(\eta \in \mathbb{R}^K\) with
support over \(\mathcal{X}^D \subseteq \mathbb{R}^D\) such that
\[
\begin{align*}
p(x | \eta ) &= \frac{1}{Z(\eta)} h(x) \exp\{\eta^T \mathcal{T}(x)\}\\\\
&= h(x) \exp\{\eta^T \mathcal{T}(x) - A(\eta)\}
\end{align*}
\]
where
- \(h(x)\) is a base measure, which is a scaling constant, often 1.
- \(\mathcal{T}(x) \in \mathbb{R}^K\) is sufficient statistics.
- \(Z(\eta)\) is a normalization constant (or partition function) and \( A(\eta) = \log Z(\eta) \).
Each exponential family is defined by different \(h(x)\) and \(\mathcal{T}(x)\).
Note: the log partition function is convex over the convex set \(\Omega = \{\eta \in \mathbb{R}^K : A(\eta) < \infty\}\).
An exponential family is said to be minimal if there is no \(\eta \in \mathbb{R}^K \setminus \{0\}\) such that \[ \eta^T\mathcal{T}(x) = 0. \] This means that the natural parameters are independent of each other. This condition can be violated in the case of multinomial distributions, but we can reparameterize the distribution using \(K-1\) independent parameters.
Let \(\eta = f(\phi)\), where \(\phi\) is some other possibly smaller set of parameters, and then \[ p(x | \phi ) = h(x) \exp\{ f(\phi)^T \mathcal{T}(x) - A(f(\phi))\}. \] If the mapping \(\phi \to \eta\) is nonlinear, it is said to be a curved exponential family.
If \(\eta = f(\phi) = \phi\), the model is in canonical form and in addition, if \(\mathcal{T} =x\), we call it a natural exponential family(NEF): \[ p(x | \eta ) = h(x) \exp\{\eta^T x - A(\eta)\}. \] Finally, we define the moment parameters as follows: \[ m = \mathbb{E }[\mathcal{T}(x)] \in \mathbb{R}^K. \]
- \(\mathcal{T}(x) = [\mathbb{I}(x=1), \, \mathbb{I}(x=0)]\).
- \(\eta = [\log(\mu), \, \log(1-\mu)]\).
- \(\mu\) is the mean parameter.
In this representation, there is a linear dependence between the features, and then we cannot define \(\eta\) uniquely. It is common to use a minimal representation so that there is a unique \(\eta\) associated with the distribution. \[ \text{Ber }(x | \mu) = \exp\Big\{x \log \Big(\frac{\mu}{1-\mu}\Big) + \log (1-\mu)\Big\} \] where
- \(\mathcal{T}(x) = x\).
- \(\eta = \log \Big(\frac{\mu}{1-\mu}\Big)\).
- \(A(\eta) = -\log (1-\mu) = \log(1+ e^{\eta})\).
- \(h(x) = 1\).
and the second cumulant is given by \[ \nabla^2 A(\eta) = \text{Cov }[\mathcal{T}(\eta)] \] which means that the Hessian is positive definite, and thus the log partition function \(A(\eta)\) is convex in \(\eta\).
- \(\mathcal{T}(x) = \begin{bmatrix}x \\ x^2 \end{bmatrix}\)
- \(\eta = \begin{bmatrix} \frac{\mu}{\sigma^2} \\ -\frac{1}{2\sigma^2} \end{bmatrix} \)
- \(A(\eta) = \frac{\mu^2}{2\sigma^2}+\log \sigma = -\frac{\eta_1^2}{4\eta_2}-\frac{1}{2}\log(-2\eta_2)\)
- \(h(x) = \frac{1}{\sqrt{2\pi}}\).
Also, the moment parameters are given by: \[ m = \begin{bmatrix} \mu \\ \mu^2 + \sigma^2 \end{bmatrix} \] Note: If \(\sigma = 1\), the distribution becomes a natural exponential family such that
- \(\mathcal{T}(x) = x\)
- \(\eta = \mu\)
- \(A(\eta) = \frac{\mu^2}{2\sigma^2}+\log \sigma = \frac{\mu^2}{2}\)
- \(h(x) = \frac{1}{\sqrt{2\pi}}\exp\{-\frac{x^2}{2}\} = N(x | 0, 1)\) : Not constant.
- \(\Lambda = \Sigma^{-1}\) is a precision matrix
- \(\xi = \Sigma^{-1}\mu\) is a precision-weighted mean vector
- \(c' = \frac{\exp\Big\{-\frac{1}{2}\xi^T \Lambda^{-1} \xi \Big\}}{(2\pi)^{\frac{D}{2}}\sqrt{\det(\Lambda^{-1})}}\).
This representation is called information form and can be converted to exponential family notation as follows: \[ \begin{align*} N_c (x | \xi, \Lambda) &= (2\pi)^{-\frac{D}{2}} \exp\Big\{\frac{1}{2}\log |\Lambda | -\frac{1}{2}\xi^T \Lambda^{-1}\xi \Big\} \exp\Big\{-\frac{1}{2}x^T \Lambda x + x^T \xi \Big\} \\\\ &= h(x)g(\eta)\exp\Big\{-\frac{1}{2}x^T \Lambda x + x^T \xi \Big\} \\\\ &= h(x)g(\eta)\exp\Big\{-\frac{1}{2}(\sum_{i, j}x_i x_j \Lambda_{ij}) + x^T \xi \Big\} \\\\ &= h(x)g(\eta)\exp\Big\{-\frac{1}{2}\text{vec}(\Lambda )^T \text{vec}(xx^T) + x^T \xi \Big\} \\\\ &= h(x)\exp\{\eta^T \mathcal{T}(x) - A(\eta)\} \end{align*} \] where
- \(\mathcal{T}(x) = [x ; \text{vec}(xx^T)]\)
- \(\eta = [\xi ; -\frac{1}{2}\text{vec}(\Lambda)] = [\Sigma^{-1}\mu ; -\frac{1}{2}\text{vec}(\Sigma^{-1})]\)
- \(A(\eta) = -\log g(\eta) = -\frac{1}{2} \log | \Lambda | + \frac{1}{2}\xi^T \Lambda^{-1} \xi \)
- \(h(x) = (2\pi)^{-\frac{D}{2}}\).
The moment parameters are given by: \[ m = [\mu ; \mu\mu^T + \Sigma]. \] Note: This form is NOT minimal since the matrix \(\Lambda\) is symmetric, so we can split it into lower and upper triangular matrices. However, in practice, non-minimal representation is easier to plug into algorithms and stable for certain operations. The minimal representation is optimized for mathematical derivations.