Intor to Bayesian Statistics

Bayesian Inference Conjugate Prior Univariate Normal Distribution Model

Bayesian Inference

In Bayesian statistics, we assume that prior knowledge about the unknown parameters can be explained by a probability distribution, which is called prior distribution, \(p(\theta)\). This means that the unknown parameters \(\theta\) are treated as random variables, while the data \(\mathcal{D}\) are considered fixed. This approach is opposite to that of frequentist statistics, where parameters are fixed and data are considered random.

Bayesian inference updates our belief(= prior distribution) by using the observed data and expresses our updated belief about the parameters as the posterior distribution, \(p(\theta | \mathcal{D})\) using Bayes' rule: \[ \begin{align*} p(\theta | \mathcal{D}) &= \frac{p(\theta)p(\mathcal{D}| \theta)}{p(\mathcal{D})} \\\\ \end{align*} \] The components are defined as follows:

  1. Prior distribution \(p(\theta)\):
  2. Represents our belief about the parameters \(\theta\) before observing any data. It encodes prior knowledge, assumptions, or uncertainty about \(\theta\).
  3. Likelihood \(p(\mathcal{D}| \theta)\):
  4. Describes how likely the observed data \(\mathcal{D}\) is, given a specific value of \(\theta\). This reflects the model of the data-generating process.
  5. Marginal likelihood(or evidence) \(p(\mathcal{D})\):
  6. A normalization constant, which is important when we evaluate different models, otherwise we can ignore this term because it is just a constant with respect to parameters \(\theta\). For continuous parameters, marginal likelihood is computed by integrating over all possible values of \(\theta\): \[ p(\mathcal{D}) = \int p(\theta')p(\mathcal{D} | \theta') \, d\theta'. \] Note: In the integral for the marginal likelihood, we use \(\theta'\) instead of \(\theta\) to clarify that the integration is over the entire parameter space, not a specific parameter.

    Note: For discrete parameters, marginal likelihood is given by: \[ p(\mathcal{D}) = \sum_{\theta'} p(\theta')p(\mathcal{D} | \theta'). \]
Note: Please check Credible intervals too.

Conjugate Prior

In general, specifying a prior is a bottleneck of the Bayesian inference. Here, we introduce some special case of priors.

A prior \(p(\theta) \in \mathcal{F}\) is saied to be conjugate prior for a likelihood function \(p(\mathcal{D} | \theta)\) if the posterior is in the same parameterized family as the prior: \(p(\mathcal{D} | \theta) \in \mathcal{F}\).

Through the following example, we introduce basic Bayesian inference ideas and an example of a conjugate prior.

Example 1: Beta-Binomial Model Consider tossing a coin \(N\) times. Let \(\theta \in [0, 1]\) be a chance of getting head. We record the outcomes as \(\mathcal{D} = \{y_n \in \{0, 1\} : n = 1 : N\}\). We assume the data are iid.

If we consider a sequence of coin tosses, the likelihood can be written as the Bernoulli likelihood model: \[ \begin{align*} p(\mathcal{D} | \theta) &= \prod_{n = 1}^N \theta^{y_n}(1 - \theta)^{1-y_n} \\\\ &= \theta^{N_1}(1 - \theta)^{N_0} \end{align*} \] where \(N_1\) and \(N_0\) are the number of heads and tails respectively. (Sample size: \(N_1 + N_0 = N\))

Alternatively, we can consider the Binomial likelihood model: The likelihood has the following form: \[ \begin{align*} p(\mathcal{D} | \theta) &= \text{Bin } (y | N, \theta) \\\\ &= \begin{pmatrix} N \\ y \end{pmatrix} \theta^y (1 - \theta)^{N - y} \\\\ &\propto \theta^y (1 - \theta)^{N - y} \end{align*} \] where \(y\) is the number of heads.

Next, we have to specify a prior. If we know nothing about the parameter, uninformative prior can be used: \[ p(\theta) = \text{Unif }(\theta | 0, 1). \] However, "in this example", using beta distribution(See Gamma & Beta distribution ), we can represent the prior as follows: \[ \begin{align*} p(\theta) = \text{Beta }(\theta | a, b) &= \frac{\Gamma(a+b)}{\Gamma(a)\Gamma(b)}\theta^{a-1}(1-\theta)^{b-1} \tag{1} \\\\ &\propto \theta^{a-1}(1-\theta)^{b-1} \end{align*} \] where \(a, b > 0\) are usually called hyper-parameters.(Our main parameter is \(\theta\).)
Note: If \(a = b = 1\), we get the uniformative prior.

Using Bayes' rule, the posterior is proportional to the product of the likelihood and the prior: \[ \begin{align*} p(\theta | \mathcal{D}) &\propto [\theta^{y}(1 - \theta)^{N-y}] \cdot [\theta^{a-1}(1-\theta)^{b-1}] \\\\ &\propto \text{Beta }(\theta | a+y, \, b+N-y) \\\\ &= \frac{\Gamma(a+b+N)}{\Gamma(a+y)\Gamma(b+N-y)}\theta^{a+y-1}(1-\theta)^{b+N-y-1}. \tag{2} \end{align*} \] Here, the posterior has the same functional form of as the prior. Thus, the beta distribution is the conjugate prior for the binomial distribution.

Once we got the posterior distribution, for example, we can use posterior mean, \(\bar{\theta}\) as a point estimate of \(\theta\): \[ \begin{align*} \bar{\theta} = \mathbb{E }[\theta | \mathcal{D}] &= \frac{a+y}{(a+y) + (b+N-y)} \\\\ &= \frac{a+y}{a+b+N}. \end{align*} \] Note: By adjusting hyper-parameters \(a\) and \(b\), we can control the influence of the prior on the posterior.
If \(a\) and \(b\) are small, the posterior mean will closely reflect the data: \[ \bar{\theta} \approx \frac{y}{N} = \hat{\theta}_{MLE} \] while if \(a\) and \(b\) are large, the posterior mean will be more influenced by the prior.

Often we need to check the standard error of our estimate, which is the posterior standard deviation: \[ \begin{align*} \text{SE }(\theta) &= \sqrt{\text{Var }[\theta | \mathcal{D}]} \\\\ &= \sqrt{\frac{(a+y)(b+N-y)}{(a+b+N)^2(a+b+N+1)}} \end{align*} \] Here, if \(N \gg a, b\), we can simplify the posterior variance as follows: \[ \begin{align*} \text{Var }[\theta | \mathcal{D}] &\approx \frac{y(N-y)}{(N)^2 N} \\\\ &= \frac{y}{N^2} - \frac{y^2}{N^3} \\\\ &= \frac{\hat{\theta}(1 - \hat{\theta})}{N} \end{align*} \] where \(\hat{\theta} = \frac{y}{N}\) is the MLE.

Thus, the standard error is given by \[ \text{SE }(\theta) \approx \sqrt{\frac{\hat{\theta}(1 - \hat{\theta})}{N}}. \]
From (1) and (2), the marginal likelihood is given by the ratio of normalization constants(beta functions) for the prior and posterior: \[ p(\mathcal{D}) = \frac{B(a+y,\, b+N-y)}{B(a, b)}. \] Note: In general, computing the marginal likelihood is too expensive or impossible, but the conjugate prior allows us to get the exact marginal likelihood easily. Otherwise, we have to introduce some approximation methods.

Finally, to make predictions for new observations, we use posterior predictive distribution: \[ p(x_{new} | \mathcal{D}) = \int p(x_{new} | \theta) p(\theta | \mathcal{D}) d\theta. \] Again, like computing the marginal likelihood, it is difficult to compute posterior predictive distribution, but in this case, we can get it easily due to the conjugate prior.
For example, the probability of observing a head in the next coin toss is given by: \[ \begin{align*} p(y_{new}=1 | \mathcal{D}) &= \int_0 ^1 p(y_{new}=1 | \theta) p(\theta | \mathcal{D}) d\theta \\\\ &= \int_0 ^1 \theta \text{Beta }(\theta | a+y, \, b+N-y) d\theta \\\\ &= \mathbb{E }[\theta|\mathcal{D}] \\\\ &= \frac{a+y}{a+b+N}. \end{align*} \] Note: As you can see, the hyper-parameters \(a\) and \(b\) is critical in the whole process of our inference. In practice, setting up hyper-parameters is one of the most challenging factor of the project.

Univariate Normal Distribution Model

Given the univariate normal distribution \(N(\mu, \sigma^2)\), there are three cases for our target posterior:

  1. \(p(\mu | \mathcal{D}, \sigma^2) \): Variance \(\sigma^2\) is known.
  2. \(p(\sigma^2 | \mathcal{D}, \mu) \): Mean \(\mu\) is known.
  3. \(p(\mu, \sigma^2 | \mathcal{D})\): Both \(\sigma^2\) and \(\mu\) are unknown.

Here, we only discuss the case 1 and case 2.
Example 2: Normal distribution model with known variance \(\sigma^2\) In the case where \(\sigma^2\) is known constant, the likelihood for \(\mu\) is given by \[ \begin{align*} p(\mathcal{D}| \mu) &= \prod_{n=1}^N \frac{1}{\sqrt{2\pi \sigma^2}} \exp \Big(-\frac{(y_n - \mu)^2}{2\sigma^2}\Big)\\\\ &= \Big(\frac{1}{\sqrt{2\pi \sigma^2}}\Big)^N \exp \Big(- \sum_{n=1}^N \frac{(y_n - \mu)^2}{2\sigma^2}\Big) \\\\ &\propto \exp \Big(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \Big). \end{align*} \] The conjugate prior is another normal distribution: \[ \begin{align*} p(\mu) &= N(\mu | \, \mu_0, \sigma_0^2) \\\\ &\propto \exp \Big(-\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \Big). \end{align*} \]
Now, we can compute the posterior as follows: \[ \begin{align*} p(\mu | \, \mathcal{D}, \sigma^2) &\propto \exp \Big(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \Big) \cdot \exp \Big(-\frac{(\mu - \mu_0)^2}{2\sigma_0^2} \Big) \\\\ &= \exp \Big\{-\frac{1}{2}\Big(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2} \Big)\mu^2 +\Big(\frac{\sum y_n}{\sigma^2} +\frac{\mu_0}{\sigma_0^2} \Big)\mu + \Big(-\frac{1}{2\sigma^2} \sum_{n=1}^N y_n^2 - \frac{\mu_0^2}{2\sigma_0^2} \Big) \Big\} \end{align*} \] Since \(-\frac{1}{2\sigma^2} \sum_{n=1}^N y_n^2 - \frac{\mu_0^2}{2\sigma_0^2}\) is constant with respect to \(\mu\), we ignore it.

Let \(A = \Big(\frac{N}{\sigma^2}+\frac{1}{\sigma_0^2} \Big)\) and \(B =\Big(\frac{\sum y_n}{\sigma^2} +\frac{\mu_0}{\sigma_0^2} \Big)\), and completing the square, we get: \[ \begin{align*} p(\mu | \, \mathcal{D}, \sigma^2) &\propto \exp \Big\{-\frac{1}{2}A\mu^2 + B\mu \Big\} \\\\ &= \exp \Big\{-\frac{1}{2}A \Big(\mu - \frac{B}{A} \Big)^2 + \frac{B^2}{2A}\Big\}. \end{align*} \] Since the term \(\frac{B^2}{2A}\) does not depend on \(\mu\), and \(-\frac{1}{2}A \Big(\mu - \frac{B}{A} \Big)^2\) is a quadratic form of an univariate normal distributuin, we conclude that \[ \begin{align*} p(\mu | \, \mathcal{D}, \sigma^2) = N(\mu | \, \mu_N, \sigma_N^2 ) \end{align*} \] where the posterior variance is given by: \[ \begin{align*} \sigma_N^2 &= \frac{1}{A} \\\\ &= \frac{\sigma^2 \sigma_0^2}{N\sigma_0^2 + \sigma^2} \end{align*} \] and the posterior mean is given by: \[ \begin{align*} \mu_N &= \frac{B}{A} \\\\ &= \sigma_N^2 \Big(\frac{\mu_0}{\sigma_0^2} + \frac{N \bar{y}}{\sigma^2} \Big) \\\\ &= \frac{\sigma^2}{N\sigma_0^2 + \sigma^2}\mu_0 + \frac{N\sigma_0^2}{N\sigma_0^2 + \sigma^2}\bar{y} \\\\ \end{align*} \] where \(\bar{y} = \frac{1}{N}\sum_{n=1}^N y_n\) is the empirical mean.

Note: In this case, \(\mu_0\) and \(\sigma_0^2\) are hyper-parameters of the prior distribution. For example, as you can see the form of the posterior mean \(\mu_N\), a small \(\sigma_0^2\) gives more weight to the prior mean \(\mu_0\), while a large \(\sigma_0^2\) reduces the influence of the prior, making the posterior more rely on the data.
In machine learning, a typical application of this model is online learning that is a training approach where the model is updated incrementally as new data becomes available rather than being trained on fixed data at once (which is known as batch learning). Online learning is particularly useful in scenarios where data arrive in a stream or is too large to be processed all at once.

Example 3: Normal distribution model with known mean \(\mu\) In the case where \(\mu\) is known constant, the likelihood for \(\sigma^2\) is given by \[ \begin{align*} p(\mathcal{D}| \sigma^2) &= \frac{1}{(2 \pi \sigma^2)^{\frac{N}{2}}} \exp \Big(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \Big) \\\\ &\propto (\sigma^2)^{-\frac{N}{2}} \exp \Big(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \Big). \end{align*} \] The "standard" conjugate prior is the inverse gamma distribution: \[ \begin{align*} p(\sigma^2) &= \text{InvGamma }(\sigma^2 | \, a, b) \\\\ &= \frac{b^a}{\Gamma(a)}(\sigma^2)^{-(a+1)} \exp \Big(- \frac{b}{\sigma^2}\Big) \end{align*} \] where \(a > 0\) is the shape parameter and \(b > 0\) is the scale parameter.

The posterior is given by: \[ \begin{align*} p(\sigma^2 | \, \mathcal{D}, \mu) &\propto \Big\{ (\sigma^2)^{-\frac{N}{2}} \exp \Big(-\frac{1}{2\sigma^2} \sum_{n=1}^N (y_n - \mu)^2 \Big) \Big\} \cdot \Big\{\frac{b^a}{\Gamma(a)}(\sigma^2)^{-(a+1)} \exp \Big(- \frac{b}{\sigma^2}\Big) \Big\} \\\\ &= (\sigma^2)^{-(a+\frac{N}{2}+1)} \exp \Big\{ - \frac{1}{\sigma^2}\Big(b + \frac{1}{2} \sum_{n=1}^N (y_n - \mu)^2 \Big) \Big\} \end{align*} \] Here, let \(\hat{a} = a + \frac{N}{2}\), and \(\hat{b} = b + \frac{1}{2} \sum_{n=1}^N (y_n - \mu)^2\). Thus, \[ p(\sigma^2 | \, \mathcal{D}, \mu) = \text{InvGamma }(\sigma^2 | \, \hat{a}, \hat{b}). \]
Alternatively, we can choose scaled inverse chi-squared distribution as the conjugate prior: \[ \begin{align*} p(\sigma^2) &= \text{ScaledInv- }\chi^2(\sigma^2 | \, \nu_0, \sigma_0^2) \\\\ &= \text{InvGamma }\Big(\sigma^2 | \, \frac{\nu_0}{2}, \frac{\nu_0 \sigma_0^2}{2}\Big) \\\\ &\propto (\sigma^2)^{-\frac{\nu_0}{2}-1} \exp \Big(- \frac{\nu_0 \sigma_0^2}{2\sigma^2} \Big) \end{align*} \] where \(\nu\) is the degrees of freedom.
Thus, the posterior is given by: \[ p(\sigma^2 | \, \mathcal{D}, \mu) = \text{ScaledInv- }\chi^2 \Big(\sigma^2 | \, \nu_0 +N, \, \frac{\nu_0 \sigma_0^2 + \sum_{n=1}^N (y_n - \mu)^2}{\nu_0 + N}\Big). \] A benefit of using this form is that we can control the strength of the prior by the single hyper-parameter \(\nu_0\).
This is typical model for quality control and process monitoring because in industrial and manufacturing processes, the mean of a quality characteristic might be known based on historical data. In machine learning, this kind of approach is called anomaly detection. Anomalies are detected by comparing new datapoints to the baseline model.