Monte Carlo Methods

Credible Intervals

In practice, a probability distribution is often a high-dimensional object, making it difficult to fully capture its characteristics. In Bayesian statistics, we frequently summarize the uncertainty of a posterior distribution using a credible interval. A credible interval is not the same as a confidence interval in frequentist statistics, even though they may sometimes numerically coincide.

Formally, for a given significance level \(\alpha \in (0,1)\), a \(100(1-\alpha)\%\) credible interval is defined as: \[ C_{\alpha}(\mathcal{D}) = (l, u) \quad \text{such that} \quad P(l \leq \theta \leq u \mid \mathcal{D}) = 1 - \alpha, \] where \(l\) and \(u\) denote the lower and upper bounds, respectively. Here, \(\mathcal{D}\) represents the observed data, and the probability is computed under the posterior distribution \(p(\theta \mid \mathcal{D})\).

This interval contains \(1-\alpha\) of the posterior probability mass. A common choice is the central credible interval, where \(\frac{1-\alpha}{2}\) of the posterior probability lies in each tail.

If the cumulative distribution function (cdf) \(F\) of the posterior distribution is known, the central credible interval can be computed as: \[ \begin{align*} l &= F^{-1}\left( \frac{\alpha}{2} \right), \\\\ u &= F^{-1}\left( 1 - \frac{\alpha}{2} \right), \end{align*} \] where \(F^{-1}\) denotes the quantile function (the inverse of the cdf).

As an example, if the posterior distribution is approximately normal with posterior mean \(\mu\) and posterior standard deviation \(\sigma\), a rough \(95\%\) credible interval can be set as: \[ (l, u) = (\mu - 2\sigma, \mu + 2\sigma), \] because for a standard normal distribution \(\mathcal{N}(0,1)\), approximately \(95\%\) of the probability mass lies within two standard deviations from the mean.

More precisely, when \(p(\theta \mid \mathcal{D}) = \mathcal{N}(\mu, \sigma^2)\) and \(\alpha = 0.05\), the central credible interval endpoints are: \[ \begin{align*} l &= \mu + \sigma \Phi^{-1}\left( \frac{\alpha}{2} \right), \\\\ u &= \mu + \sigma \Phi^{-1}\left( 1 - \frac{\alpha}{2} \right), \end{align*} \] where \(\Phi\) is the cumulative distribution function (cdf) of the standard normal distribution.

In particular, if the posterior distribution is the standard normal \(p(\theta \mid \mathcal{D}) = \mathcal{N}(0,1)\), then: \[ \begin{align*} l &= \Phi^{-1}(0.025) \approx -1.96, \\\\ u &= \Phi^{-1}(0.975) \approx 1.96, \end{align*} \] so the \(95\%\) credible interval is approximately \((-1.96, 1.96)\).

Monte Carlo Approximation

In general, it is often difficult to compute the inverse cdf of the posterior distribution exactly. In such cases, a simple and practical alternative is to use a Monte Carlo approximation based on samples drawn from the posterior.

Suppose we can generate samples \(\theta^{(1)}, \theta^{(2)}, \dots, \theta^{(S)}\) independently from the posterior \(p(\theta \mid \mathcal{D})\). Then, to approximate a \((1-\alpha)\)% credible interval:

Sort the samples in increasing order: \(\theta^{(1)} \leq \theta^{(2)} \leq \cdots \leq \theta^{(S)}\).
Set the lower endpoint \(l\) to be the \(\left(\frac{\alpha}{2}\right)\)-quantile, and the upper endpoint \(u\) to be the \(\left(1 - \frac{\alpha}{2}\right)\)-quantile.

More precisely, approximate: \[ \begin{align*} l &\approx \theta^{\left(\left\lceil S \times \frac{\alpha}{2} \right\rceil\right)}, \\\\ u &\approx \theta^{\left(\left\lceil S \times \left(1 - \frac{\alpha}{2}\right) \right\rceil\right)}, \end{align*} \] where \(\lceil \cdot \rceil\) denotes the ceiling function, rounding up to the nearest integer if necessary.

Intuitively, instead of computing the inverse cdf analytically, we use the empirical cumulative distribution function (ecdf) formed by the sorted samples to estimate the desired quantiles. As the number of samples \(S \to \infty\), the empirical quantiles converge to the true quantiles of the posterior by the law of large numbers.

This Monte Carlo method is especially powerful when the posterior distribution is complex or high-dimensional, and no closed-form expression for the cdf or its inverse is available.

Demo: Monte Carlo Approximation for CIs

Note: This demo does not perform full Bayesian inference from data. Instead, it uses predefined probability distributions to simulate "posterior-like" behavior. This allows us to illustrate how Monte Carlo methods are used to approximate credible intervals — particularly central and HPD intervals—without needing a prior or likelihood.

Monte Carlo methods are widely used in Bayesian statistics to approximate posterior distributions, especially when closed-form solutions are unavailable. One common application is estimating credible intervals using random samples drawn from the posterior.

To construct a central credible interval via Monte Carlo sampling:

Draw many random samples from the posterior distribution
Sort the samples in ascending order
Select the appropriate quantiles (e.g., the 2.5% and 97.5% quantiles for a 95% interval)

As the number of samples increases, the approximation of the posterior improves. However, central credible intervals based on quantiles may not accurately reflect the distribution's shape — especially in skewed or multimodal cases.

For example, in a bimodal distribution (with two distinct peaks), the interval may:

Fall in a low-probability region between the modes
Omit one of the peaks entirely
Misrepresent the true uncertainty or structure of the posterior

In such cases, highest posterior density (HPD) regions are preferred. They represent the region(s) of highest probability mass and always include the most probable values, even if the result is disjoint.

Highest Posterior Density (HPD)

The highest posterior density (HPD) region is defined as the set of values of the parameter θ where the posterior density exceeds a certain threshold \(p^*\), such that the total probability mass within this region equals the desired confidence level \(1 - \alpha\):

\[ 1 - \alpha = \int_{\theta: p(\theta \mid \mathcal{D}) > p^*} p(\theta \mid \mathcal{D}) \, d\theta \] \[ C_{\alpha} (\mathcal{D}) = \{\theta : p(\theta \mid \mathcal{D}) \geq p^*\} \]

In one dimension, this HPD region corresponds to the interval(s) containing the highest probability density values, with total mass equal to the target confidence level (e.g., 95%). In this case, HPD regions are sometimes also called highest density intervals (HDIs).

Unlike central credible intervals, HPD regions are not required to be symmetric or centered around the mean. This makes them especially useful when the posterior distribution is skewed or multimodal. In such cases, HPD intervals better capture the actual regions of highest likelihood.

In the demo, you can observe that the HPD interval is often narrower than the central credible interval, and may exclude low-density regions that the central interval mistakenly includes.

Markov Chain Monte Carlo (MCMC)

So far, our demo uses predefined distributions to illustrate how Monte Carlo methods can approximate credible intervals. However, in real Bayesian inference, the full posterior distribution \( p(\theta \mid \mathcal{D}) \) is rarely available in closed form.

Instead, we typically work with the unnormalized posterior: \[ p(\theta \mid \mathcal{D}) \propto p(\mathcal{D} \mid \theta) \cdot p(\theta) = p^*(\theta) \] where

\( p(\mathcal{D} \mid \theta) \): the likelihood, expressing how probable the data is given a parameter
\( p(\theta) \): the prior belief about the parameter before seeing data

This proportionality holds because the normalizing constant \( p(\mathcal{D}) \) does not depend on \(\theta\) and can be ignored during sampling.

Remember, to compute the actual posterior \( p(\theta \mid \mathcal{D}) \), we would need to normalize \( p^*(\theta) \) by computing: \[ p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) \cdot p(\theta) \, d\theta \] which is often intractable, especially in high-dimensional problems.

This is where Markov Chain Monte Carlo (MCMC) methods become essential. MCMC algorithms constructs a Markov chain whose long-run behavior (stationary distribution) approximates the posterior distribution. This allows us to generate samples from the posterior without needing to compute the normalization constant \( p(\mathcal{D}) \).

The following are three classic MCMC methods that are foundational to Bayesian computation:

Metropolis-Hastings: A general-purpose algorithm that proposes new samples and accepts or rejects them based on a probability ratio
Gibbs Sampling: Efficient when conditional distributions are known, commonly used in simpler models and probabilistic graphical models
Hamiltonian Monte Carlo (HMC): A gradient-based method well-suited for high-dimensional posteriors; still widely used in probabilistic programming languages like Stan and PyMC

While these methods are not used to train large-scale models like LLMs, they remain essential for Bayesian modeling, especially in scientific and uncertainty-aware applications. We will revisit each of them in the future.

In modern machine learning and AI — particularly in large-scale models like transformers or Large Language Models (LLMs) — MCMC is not used for training, due to its high computational cost and poor scalability in very high-dimensional parameter spaces. These models are instead trained using gradient-based optimization (e.g., stochastic gradient descent or Adam) with maximum likelihood estimation. However, Bayesian ideas remain important. Methods inspired by MCMC are still used for uncertainty quantification, probabilistic modeling, simulation-based inference, and decision-making under uncertainty.

Probability & Statistics

Credible Intervals

Monte Carlo Approximation

Demo: Monte Carlo Approximation for CIs

Highest Posterior Density (HPD)

Markov Chain Monte Carlo (MCMC)