The Exponential Family

The Exponential Family MLE for the Exponential Family

The Exponential Family

Actually, all members of the exponential family have a conjugate prior. Before diving into Bayesian statistics deeper, we discuss this important family of distributions.

The exponential family is a family of probability distributions parameterized by natural parameters(or canonical parameters) \(\eta \in \mathbb{R}^K\) with support over \(\mathcal{X}^D \subseteq \mathbb{R}^D\) such that \[ \begin{align*} p(x | \eta ) &= \frac{1}{Z(\eta)} h(x) \exp\{\eta^T \mathcal{T}(x)\}\\\\ &= h(x) \exp\{\eta^T \mathcal{T}(x) - A(\eta)\} \end{align*} \] where


Each exponential family is defined by different \(h(x)\) and \(\mathcal{T}(x)\).
Note: the log partition function is convex over the convex set \(\Omega = \{\eta \in \mathbb{R}^K : A(\eta) < \infty\}\).

An exponential family is said to be minimal if there is no \(\eta \in \mathbb{R}^K \setminus \{0\}\) such that \[ \eta^T\mathcal{T}(x) = 0. \] This means that the natural parameters are independent of each other. This condition can be violated in the case of multinomial distributions, but we can reparameterize the distribution using \(K-1\) independent parameters.

Let \(\eta = f(\phi)\), where \(\phi\) is some other possibly smaller set of parameters, and then \[ p(x | \phi ) = h(x) \exp\{ f(\phi)^T \mathcal{T}(x) - A(f(\phi))\}. \] If the mapping \(\phi \to \eta\) is nonlinear, it is said to be a curved exponential family.
If \(\eta = f(\phi) = \phi\), the model is in canonical form and in addition, if \(\mathcal{T} =x\), we call it a natural exponential family(NEF): \[ p(x | \eta ) = h(x) \exp\{\eta^T x - A(\eta)\}. \] Finally, we define the moment parameters as follows: \[ m = \mathbb{E }[\mathcal{T}(x)] \in \mathbb{R}^K. \]
Example 1: Bernoulli distribution \[ \begin{align*} \text{Ber }(x | \mu) &= \mu^x (1-\mu)^{1-x} \\\\ &= \exp\{x \log (\mu) + (1-x) \log (1-\mu)\}\\\\ &= \exp\{\mathcal{T}(x)^T \eta\} \end{align*} \] where
  • \(\mathcal{T}(x) = [\mathbb{I}(x=1), \, \mathbb{I}(x=0)]\).
  • \(\eta = [\log(\mu), \, \log(1-\mu)]\).
  • \(\mu\) is the mean parameter.

In this representation, there is a linear dependence between the features, and then we cannot define \(\eta\) uniquely. It is common to use a minimal representation so that there is a unique \(\eta\) associated with the distribution. \[ \text{Ber }(x | \mu) = \exp\Big\{x \log \Big(\frac{\mu}{1-\mu}\Big) + \log (1-\mu)\Big\} \] where
  • \(\mathcal{T}(x) = x\).
  • \(\eta = \log \Big(\frac{\mu}{1-\mu}\Big)\).
  • \(A(\eta) = -\log (1-\mu) = \log(1+ e^{\eta})\).
  • \(h(x) = 1\).
Note: The mean parameter \(\mu\) can be recovered from the canonical parameter \(\eta\): \[ \mu = \sigma(\eta) = \frac{1}{1+e^{-\eta}} \text{(This is a logistic function.)} \] Also, you might notice that \[ \begin{align*} \frac{dA}{d \eta} &= \frac{d}{d\eta} \log(1+ e^{\eta}) \\\\ &= \frac{e^{\eta}}{1 + e^{\eta}} \\\\ &= \frac{1}{1+e^{-\eta}} \\\\ &= \mu \end{align*} \] In general, derivatives of the log partition function generate all the cumulants of the sufficient statistics. For example, the first cumulant is given by \[ \nabla A(\eta) = \mathbb{E }[\mathcal{T}(\eta)] \] (Remember, this is moment parameters, \(m\))
and the second cumulant is given by \[ \nabla^2 A(\eta) = \text{Cov }[\mathcal{T}(\eta)] \] which means that the Hessian is positive definite, and thus the log partition function \(A(\eta)\) is convex in \(\eta\).
Example 2: Normal distribution \[ \begin{align*} N(x | \mu, \, \sigma^2) &= \frac{1}{\sigma\sqrt{2\pi}}\exp\Big\{-\frac{1}{2\sigma^2}(x - \mu)^2\Big\} \\\\ &= \frac{1}{\sqrt{2\pi}} \exp\Big\{ \frac{\mu}{\sigma^2}x - \frac{1}{2\sigma^2}x^2 -\frac{1}{2\sigma^2}\mu^2 -\log \sigma\Big\} \end{align*} \] where
  • \(\mathcal{T}(x) = \begin{bmatrix}x \\ x^2 \end{bmatrix}\)
  • \(\eta = \begin{bmatrix} \frac{\mu}{\sigma^2} \\ -\frac{1}{2\sigma^2} \end{bmatrix} \)
  • \(A(\eta) = \frac{\mu^2}{2\sigma^2}+\log \sigma = -\frac{\eta_1^2}{4\eta_2}-\frac{1}{2}\log(-2\eta_2)\)
  • \(h(x) = \frac{1}{\sqrt{2\pi}}\).

Also, the moment parameters are given by: \[ m = \begin{bmatrix} \mu \\ \mu^2 + \sigma^2 \end{bmatrix} \] Note: If \(\sigma = 1\), the distribution becomes a natural exponential family such that
  • \(\mathcal{T}(x) = x\)
  • \(\eta = \mu\)
  • \(A(\eta) = \frac{\mu^2}{2\sigma^2}+\log \sigma = \frac{\mu^2}{2}\)
  • \(h(x) = \frac{1}{\sqrt{2\pi}}\exp\{-\frac{x^2}{2}\} = N(x | 0, 1)\) : Not constant.
Example 3: Multivariate Normal distribution(MVN) \[ \begin{align*} N(x | \mu, \Sigma) &= \frac{1}{(2\pi)^{\frac{D}{2}}\sqrt{\det(\Sigma)}} \exp\Big\{ \frac{1}{2}x^T\Sigma^{-1}x + x^T\Sigma^{-1}\mu -\frac{1}{2}\mu^T\Sigma^{-1}\mu \Big\}\\\\ &= c \exp\Big\{x^T\Sigma^{-1}\mu -\frac{1}{2}x^T\Sigma^{-1}x \Big\} \end{align*} \] where \[ c = \frac{\exp\Big\{-\frac{1}{2}\mu^T \Sigma^{-1} \mu\Big\}}{(2\pi)^{\frac{D}{2}}\sqrt{\det(\Sigma)}} \] and \(\Sigma\) ia a covariance matrix. Now, we represent this model using canonical parameters. \[ N_c (x | \xi, \Lambda) = c' \exp \Big\{x^T \xi \frac{1}{2}x^T \Lambda x \Big\} \] where
  • \(\Lambda = \Sigma^{-1}\) is a precision matrix
  • \(\xi = \Sigma^{-1}\mu\) is a precision-weighted mean vector
  • \(c' = \frac{\exp\Big\{-\frac{1}{2}\xi^T \Lambda^{-1} \xi \Big\}}{(2\pi)^{\frac{D}{2}}\sqrt{\det(\Lambda^{-1})}}\).

This representation is called information form and can be converted to exponential family notation as follows: \[ \begin{align*} N_c (x | \xi, \Lambda) &= (2\pi)^{-\frac{D}{2}} \exp\Big\{\frac{1}{2}\log |\Lambda | -\frac{1}{2}\xi^T \Lambda^{-1}\xi \Big\} \exp\Big\{-\frac{1}{2}x^T \Lambda x + x^T \xi \Big\} \\\\ &= h(x)g(\eta)\exp\Big\{-\frac{1}{2}x^T \Lambda x + x^T \xi \Big\} \\\\ &= h(x)g(\eta)\exp\Big\{-\frac{1}{2}(\sum_{i, j}x_i x_j \Lambda_{ij}) + x^T \xi \Big\} \\\\ &= h(x)g(\eta)\exp\Big\{-\frac{1}{2}\text{vec}(\Lambda )^T \text{vec}(xx^T) + x^T \xi \Big\} \\\\ &= h(x)\exp\{\eta^T \mathcal{T}(x) - A(\eta)\} \end{align*} \] where
  • \(\mathcal{T}(x) = [x ; \text{vec}(xx^T)]\)
  • \(\eta = [\xi ; -\frac{1}{2}\text{vec}(\Lambda)] = [\Sigma^{-1}\mu ; -\frac{1}{2}\text{vec}(\Sigma^{-1})]\)
  • \(A(\eta) = -\log g(\eta) = -\frac{1}{2} \log | \Lambda | + \frac{1}{2}\xi^T \Lambda^{-1} \xi \)
  • \(h(x) = (2\pi)^{-\frac{D}{2}}\).

The moment parameters are given by: \[ m = [\mu ; \mu\mu^T + \Sigma]. \] Note: This form is NOT minimal since the matrix \(\Lambda\) is symmetric, so we can split it into lower and upper triangular matrices. However, in practice, non-minimal representation is easier to plug into algorithms and stable for certain operations. The minimal representation is optimized for mathematical derivations.

MLE for the Exponential Family

The likelihood of an exponential family model is given by: \[ \begin{align*} p(\mathcal{D} | \eta) &= \Big\{\prod_{n=1}^N h(x_n)\Big\} \exp\Big\{\eta^T \Big[\sum_{n=1}^N \mathcal{T}(x_n)\Big] - N A(\eta)\Big\} \\\\ &\propto \exp\{\eta^T \mathcal{T}(\mathcal{D}) - N A(\eta)\} \end{align*} \] where \(\mathcal{T}(\mathcal{D})\) are the sufficient statistics: \[ \mathcal{T}(\mathcal{D}) = \begin{bmatrix}\sum_{n=1}^N \mathcal{T}_1 (x_n) \\ \vdots \\ \sum_{n=1}^N \mathcal{T}_K (x_n) \end{bmatrix}. \]
The derivative of the log partition function yields the expected value of the sufficient statistic vector: \[ \begin{align*} \nabla_{\eta} \log p(\mathcal{D} | \eta) &= \nabla_{\eta} \eta^T \mathcal{T}(\mathcal{D}) - N \nabla_{\eta} A(\eta) \\\\ &= \mathcal{T}(\mathcal{D}) - N \mathbb{E }[\mathcal{T}(x)] \end{align*} \] Setting this gradient to zero, we obtain the MLE \(\hat{\eta}\), which must satisfy \[ \mathbb{E }[\mathcal{T}(x)] = \frac{1}{N}\sum_{n=1}^N \mathcal{T}(x_n). \] This means that the empirical average of the sufficient statistics equals to the model's theoretical expected sufficient statistics. This is called moment matching. For example, in Gaussian distribution, its empirical mean(the first moment) is given by: \[ \bar{x} = \frac{1}{N}\sum_{n=1}^N x_n. \] and its expected value(theoretical moment under the model) is \[ \mathbb{E }[x] = \mu = \bar{x}. \] As we know, the MLE estimate \(\hat{\mu} = \bar{x}\), which leads to moment matching conditions. Thus, MLE for exponential family distributions naturally reduces to solving moment matching conditions, making the estimation process simpler.