In general, specifying a prior is a bottleneck of the Bayesian inference. Here, we introduce some special case of priors.
A prior \(p(\theta) \in \mathcal{F}\) is saied to be conjugate prior for a likelihood function \(p(\mathcal{D} | \theta)\) if
the posterior is in the same parameterized family as the prior: \(p(\mathcal{D} | \theta) \in \mathcal{F}\).
Through the following example, we introduce basic Bayesian inference ideas and an example of a conjugate prior.
Example 1: Beta-Binomial Model
Consider tossing a coin \(N\) times. Let \(\theta \in [0, 1]\) be a chance of getting head. We record the outcomes as
\(\mathcal{D} = \{y_n \in \{0, 1\} : n = 1 : N\}\). We assume the data are iid.
If we consider a sequence of coin tosses, the
likelihood can be written as the Bernoulli likelihood model:
\[
\begin{align*}
p(\mathcal{D} | \theta) &= \prod_{n = 1}^N \theta^{y_n}(1 - \theta)^{1-y_n} \\\\
&= \theta^{N_1}(1 - \theta)^{N_0}
\end{align*}
\]
where \(N_1\) and \(N_0\) are the number of heads and tails respectively. (Sample size: \(N_1 + N_0 = N\))
Alternatively, we can consider the Binomial likelihood model:
The likelihood has the following form:
\[
\begin{align*}
p(\mathcal{D} | \theta) &= \text{Bin } (y | N, \theta) \\\\
&= \begin{pmatrix} N \\ y \end{pmatrix} \theta^y (1 - \theta)^{N - y} \\\\
&\propto \theta^y (1 - \theta)^{N - y}
\end{align*}
\]
where \(y\) is the number of heads.
Next, we have to specify a
prior. If we know nothing about the parameter,
uninformative prior can
be used:
\[
p(\theta) = \text{Unif }(\theta | 0, 1).
\]
However, "in this example", using beta distribution(See
Gamma & Beta distribution ), we can represent the prior as follows:
\[
\begin{align*}
p(\theta) = \text{Beta }(\theta | a, b) &= \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1} \tag{1} \\\\
&\propto \theta^{a-1}(1-\theta)^{b-1}
\end{align*}
\]
where \(a, b > 0\) are usually called hyper-parameters.(Our main parameter is \(\theta\).)
Note: If \(a = b = 1\), we get the uniformative prior.
Using Bayes' rule, the
posterior is proportional to the product of the likelihood and the prior:
\[
\begin{align*}
p(\theta | \mathcal{D}) &\propto [\theta^{y}(1 - \theta)^{N-y}] \cdot [\theta^{a-1}(1-\theta)^{b-1}] \\\\
&\propto \text{Beta }(\theta | a+y, \, b+N-y) \\\\
&= \frac{\Gamma(a+b+N)}{\Gamma(a+y)\Gamma(b+N-y)}\theta^{a+y-1}(1-\theta)^{b+N-y-1}. \tag{2}
\end{align*}
\]
Here, the posterior has the same functional form of as the prior. Thus, the beta distribution is the
conjugate prior for the
binomial distribution.
Once we got the posterior distribution, for example, we can use
posterior mean, \(\bar{\theta}\) as a point estimate of \(\theta\):
\[
\begin{align*}
\bar{\theta} = \mathbb{E }[\theta | \mathcal{D}] &= \frac{a+y}{(a+y) + (b+N-y)} \\\\
&= \frac{a+y}{a+b+N}.
\end{align*}
\]
Note:
By adjusting hyper-parameters \(a\) and \(b\), we can control the influence of the prior on the posterior.
If \(a\) and \(b\) are small, the posterior mean will closely reflect the data:
\[
\bar{\theta} \approx \frac{y}{N} = \hat{\theta}_{MLE}
\]
while if \(a\) and \(b\) are large, the posterior mean will be more influenced by the prior.
Often we need to check the
standard error of our estimate, which is the posterior standard deviation:
\[
\begin{align*}
\text{SE }(\theta) &= \sqrt{\text{Var }[\theta | \mathcal{D}]} \\\\
&= \sqrt{\frac{(a+y)(b+N-y)}{(a+b+N)^2(a+b+N+1)}}
\end{align*}
\]
Here, if \(N \gg a, b\), we can simplify the
posterior variance as follows:
\[
\begin{align*}
\text{Var }[\theta | \mathcal{D}] &\approx \frac{y(N-y)}{(N)^2 N} \\\\
&= \frac{y}{N^2} - \frac{y^2}{N^3} \\\\
&= \frac{\hat{\theta}(1 - \hat{\theta})}{N}
\end{align*}
\]
where \(\hat{\theta} = \frac{y}{N}\) is the MLE.
Thus, the standard error is given by
\[
\text{SE }(\theta) \approx \sqrt{\frac{\hat{\theta}(1 - \hat{\theta})}{N}}.
\]
From (1) and (2), the
marginal likelihood is given by the ratio of normalization constants(beta functions)
for the prior and posterior:
\[
p(\mathcal{D}) = \frac{B(a+y,\, b+N-y)}{B(a, b)}.
\]
Note: In general, computing the marginal likelihood is too expensive or impossible, but the conjugate prior allows us to get the
exact marginal likelihood easily. Otherwise, we have to introduce some approximation methods.
Finally, to make predictions for new observations, we use
posterior predictive distribution:
\[
p(x_{new} | \mathcal{D}) = \int p(x_{new} | \theta) p(\theta | \mathcal{D}) d\theta.
\]
Again, like computing the marginal likelihood, it is difficult to compute posterior predictive distribution, but in this case,
we can get it easily due to the conjugate prior.
For example, the probability of observing a head in the next coin toss is given by:
\[
\begin{align*}
p(y_{new}=1 | \mathcal{D}) &= \int_0 ^1 p(y_{new}=1 | \theta) p(\theta | \mathcal{D}) d\theta \\\\
&= \int_0 ^1 \theta \text{Beta }(\theta | a+y, \, b+N-y) d\theta \\\\
&= \mathbb{E }[\theta|\mathcal{D}] \\\\
&= \frac{a+y}{a+b+N}.
\end{align*}
\]
Note: As you can see, the hyper-parameters \(a\) and \(b\) is critical in the whole process of our inference. In practice, setting up
hyper-parameters is one of the most challenging factor of the project.