Introduction
Importance sampling is a fundamental Monte Carlo technique for estimating expectations under one probability
distribution by sampling from a different, more convenient distribution. This approach addresses two critical challenges in
computational statistics and machine learning when:
- direct sampling from the target distribution is difficult or impossible.
- we want to focus computational resources on the most "important" regions of the distribution.
Suppose we want to compute the expectation of a target function \(\varphi(\mathbf{x})\) under a
target distribution \(\pi(\mathbf{x})\):
\[
I = \mathbb{E}_{\pi}[\varphi(\mathbf{x})] = \int \varphi(\mathbf{x}) \pi(\mathbf{x}) \, d\mathbf{x}
\]
Standard Monte Carlo estimation would draw independent samples
\(\mathbf{x}_1, \ldots, \mathbf{x}_n \sim \pi(\mathbf{x})\) and approximate:
\[
I \approx \hat{I}_{\text{MC}} = \frac{1}{N_s} \sum_{n=1}^{N_s} \varphi(\mathbf{x}_n).
\]
However, this approach fails when:
- Sampling from \(\pi(\mathbf{x})\) is computationally expensive or infeasible
- The normalizing constant of \(\pi(\mathbf{x})\) is unknown (common in Bayesian inference)
- \(\varphi(\mathbf{x})\) has large values only in regions where \(\pi(\mathbf{x})\) has low probability (rare events)
Importance sampling solves these problems by introducing a proposal distribution
(or importance distribution) \(q(\mathbf{x})\) from which we can easily sample, then reweighting samples
to account for the distribution mismatch.
Importance sampling has become essential in modern AI, particularly in the following domains:
Modern large language models (LLMs) like ChatGPT are fine-tuned using RLHF, which fundamentally relies on
importance sampling. The Proximal Policy Optimization (PPO) algorithm uses importance sampling to
enable multiple gradient updates from the same batch of collected data:
\[
L(\theta) = \mathbb{E}_{(s,a) \sim \mu}\left[\min\left(\frac{\pi_\theta(a|s)}{\mu(a|s)} A(s,a), \text{clip}\left(\frac{\pi_\theta(a|s)}{\mu(a|s)}, 1-\epsilon, 1+\epsilon\right) A(s,a)\right)\right]
\]
where \(\mu\) is the behavior policy (data collection policy) and \(\pi_\theta\) is the current policy being optimized.
The importance ratio \(\pi_\theta/\mu\) allows us to reuse data collected under \(\mu\) for multiple updates
to \(\pi_\theta\). PPO's clipping mechanism (typically with \(\epsilon = 0.2\)) prevents this ratio from becoming
too large, ensuring stable training. Without importance sampling, we would need fresh on-policy data after
every gradient step, making RLHF computationally prohibitive.
2. Off-Policy Reinforcement Learning
In robotics, autonomous systems, and game AI, collecting on-policy data can be dangerous or expensive. Importance sampling
enables learning from historical data, human demonstrations, or safer exploration policies. This is crucial for algorithms
like Soft Actor-Critic (SAC), Conservative Q-Learning (CQL), and offline RL methods.
3. Variational Inference and Deep Generative Models
Importance-weighted autoencoders (IWAE) use importance sampling to obtain tighter bounds on the evidence lower bound (ELBO),
leading to better generative models and more accurate uncertainty estimates in Bayesian deep learning.
4. Imbalanced Data and Rare Events
In applications like fraud detection, medical diagnosis of rare diseases, and AI safety, the events we care most about
are extremely rare. Importance sampling allows us to oversample these critical cases and reweight appropriately, ensuring
models do not ignore rare but important scenarios.
5. Computing Normalizing Constants
Many probabilistic models in AI (Boltzmann machines, energy-based models, certain Bayesian posteriors) have intractable
normalizing constants. Importance sampling, particularly annealed importance sampling, provides practical methods to
estimate these constants, which is essential for model comparison and learning.
The critical insight is that importance sampling enables sample efficiency: in an era where data collection
and computation are expensive, the ability to reuse and reweight existing data rather than collecting new samples is invaluable.
This makes importance sampling one of the foundational techniques enabling practical, scalable AI systems.
Direct Importance Sampling
The key mathematical insight behind importance sampling is the change of measure. We can rewrite the expectation under
the target distribution \(\pi(\mathbf{x})\) as an expectation under some proposal distribution \(q(\mathbf{x})\):
\[
\begin{align*}
I &= \int \varphi(\mathbf{x}) \pi(\mathbf{x}) \, d\mathbf{x} \\\\
&= \int \varphi(\mathbf{x}) \frac{\pi(\mathbf{x})}{q(\mathbf{x})} q(\mathbf{x}) \, d\mathbf{x} \\\\
&= \mathbb{E}_{q}\left[\varphi(\mathbf{x}) \frac{\pi(\mathbf{x})}{q(\mathbf{x})}\right]
\end{align*}
\]
This identity holds for any \(q(\mathbf{x})\) that satisfies the support condition:
\(q(\mathbf{x}) > 0\) whenever \(\pi(\mathbf{x}) \neq 0\).
The ratio
\[
\tilde{w}_n = \frac{\pi(\mathbf{x})}{q(\mathbf{x})}
\]
is called the importance weights. It measures how much more (or less) likely \(\mathbf{x}\) is under the target
distribution \(\pi\) compared to the proposal distribution \(q\). Note that while \(\pi\) and \(q\) are normalized
distributions, the weights \(\tilde{w}_n\) do not sum to any particular value.
Assume that the normalized \(\pi (\mathbf{x})\) can be evaluated, but we are not able to sample from it.
If we draw \(N_s\) samples, \(\mathbf{x}_n \sim q(\mathbf{x})\), the direct importance sampling estimator is:
\[
\mathbb{E}[\varphi(\mathbf{x})] \approx \hat{I}_{\text{IS}} = \frac{1}{N_s} \sum_{n=1}^{N_s} \tilde{w}_n \varphi(x_n).
\]
This estimator is unbiased:
\[
\begin{align*}
\mathbb{E}_q[\hat{I}_{\text{IS}}] &= \mathbb{E}_q\left[\frac{1}{N_s}\sum_{n=1}^{N_s} \tilde{w}_n(\mathbf{x}_n)\varphi(\mathbf{x}_n)\right] \\\\
&= \mathbb{E}_q[\tilde{w}_n(\mathbf{x})\varphi(\mathbf{x})] \\\\
&= I
\end{align*}
\]
While the estimator is unbiased, its variance depends critically on the choice of \(q\):
\[
\text{Var}[\hat{I}_{\text{IS}}] = \frac{1}{N_s}\left(\mathbb{E}_q[\tilde{w}_n^2(\mathbf{x})\varphi^2(\mathbf{x})] - I^2\right)
\]
If the importance weights vary dramatically, a few samples will dominate the estimate, leading to high variance.
This is quantified by the effective sample size (ESS), which measures the number of independent
samples that would provide the same statistical efficiency as the current weighted sample. The ESS can be computed
in two equivalent ways:
Using unnormalized weights \(\tilde{w}_n\):
\[
\text{ESS} = \frac{\left(\sum_{n=1}^{N_s} \tilde{w}_n\right)^2}{\sum_{n=1}^{N_s} \tilde{w}_n^2}
\]
Using normalized weights \(W_n = \tilde{w}_n / \sum_{n'=1}^{N_s} \tilde{w}_{n'}\):
\[
\text{ESS} = \frac{1}{\sum_{n=1}^{N_s} W_n^2}
\]
These formulas are mathematically equivalent. The ESS ranges from 1 (maximum degeneracy, where all weight
concentrates on one sample) to \(N_s\) (no degeneracy, where all weights are equal). A good rule of thumb is
that if \(\text{ESS} < N_s/2\), the proposal distribution \(q(\mathbf{x})\) should be reconsidered, as this
indicates that many samples have negligible contribution to the estimate.
Requirements for Direct Importance Sampling
For direct importance sampling to be valid and practical:
- Support condition: \(q(\mathbf{x}) > 0\) whenever \(\pi(\mathbf{x}) > 0\)
- Normalization: Both \(\pi(\mathbf{x})\) and \(q(\mathbf{x})\) must be properly normalized probability distributions
- Evaluability: We must be able to evaluate \(\pi(\mathbf{x})\) and \(q(\mathbf{x})\) for any \(\mathbf{x}\)
- Sampling: We must be able to draw samples from \(q(\mathbf{x})\) efficiently
Self-Normalized Importance Sampling
In many practical applications, we can only evaluate the unnormalized target distribution:
\[
\tilde{\gamma}(\mathbf{x}) = Z \pi(\mathbf{x})
\]
where the normalization constant
\[
Z = \int \tilde{\gamma}(\mathbf{x})d\mathbf{x}
\]
is unknown or intractable. This situation is ubiquitous in Bayesian inference (where \(Z\) is the marginal
likelihood), energy-based models (where \(Z\) is the partition function), and many other machine learning applications.
Self-normalized importance sampling (SNIS) addresses this by working with unnormalized weights
and normalizing them within the estimator itself. We can rewrite the expectation as a ratio of two integrals:
\[
\begin{align*}
\mathbb{E}[\varphi(\mathbf{x})] &= \int \varphi(\mathbf{x}) \pi(\mathbf{x})d\mathbf{x} \\\\
&= \frac{\int \varphi(\mathbf{x})\tilde{\gamma}(\mathbf{x})d\mathbf{x}}
{\int \tilde{\gamma}(\mathbf{x})d\mathbf{x}} \\\\
&= \frac{\int \varphi(\mathbf{x})\frac{\tilde{\gamma}(\mathbf{x})}{q(\mathbf{x})}q(\mathbf{x})d\mathbf{x}}
{\int \frac{\tilde{\gamma}(\mathbf{x})}{q(\mathbf{x})}q(\mathbf{x})d\mathbf{x}} \\\\
&= \frac{\mathbb{E}_{q}\left[\varphi(\mathbf{x}) \frac{\tilde{\gamma}(\mathbf{x})}{q(\mathbf{x})}\right]}
{\mathbb{E}_{q}\left[\frac{\tilde{\gamma}(\mathbf{x})}{q(\mathbf{x})}\right]}
\end{align*}
\]
Sampling \(\mathbf{x}_n \sim q(\mathbf{x})\) for \(n = 1, \ldots, N_s\), we define the unnormalized importance weights:
\[
\tilde{w}_n = \frac{\tilde{\gamma}(\mathbf{x}_n)}{q(\mathbf{x}_n)}
\]
The self-normalized importance sampling estimator approximates both the numerator and denominator:
\[
\mathbb{E}[\varphi(\mathbf{x})] \approx \hat{I}_{\text{SNIS}} = \frac{\sum_{n=1}^{N_s} \tilde{w}_n\varphi(\mathbf{x}_n)}
{\sum_{n=1}^{N_s}\tilde{w}_n}
= \sum_{n=1}^{N_s} W_n \varphi(\mathbf{x}_n)
\]
where \(W_n = \tilde{w}_n/\sum_{n'=1}^{N_s}\tilde{w}_{n'}\) are the normalized weights introduced in the previous section.
This estimator is biased for finite \(N_s\) because it involves a ratio of random variables,
but it is consistent: as \(N_s \to \infty\), the bias vanishes and the estimator converges
almost surely to the true expectation.
Despite the bias, SNIS is the standard approach in practice for several important reasons.
Most importantly, it only requires evaluating unnormalized densities \(\tilde{\gamma}(\mathbf{x})\)
and \(q(\mathbf{x})\), making it applicable when the normalization constant \(Z\) is intractable—a
common situation in Bayesian inference, energy-based models, and many machine learning applications.
While SNIS can have a positive effect on the dispersion (variance) of the estimator in certain cases,
it does not universally reduce variance compared to direct importance sampling. The key advantage is
practical: SNIS enables estimation when direct importance sampling is impossible due to unknown
normalization constants.
Annealed Importance Sampling (AIS)
When the target distribution and proposal distribution have substantial distributional mismatch,
importance sampling often fails due to extremely high variance in the importance weights.
Most samples from the proposal will have near-zero weight under the target, while a few rare samples
will have enormous weights — a phenomenon known as weight degeneracy.
This leads to a low effective sample size, meaning that despite drawing many samples,
few actually contribute meaningfully to the estimate, making the results unreliable.
Annealed importance sampling (AIS) addresses this fundamental limitation by introducing a sequence
of intermediate distributions that smoothly bridge the gap between the proposal and target distributions,
dramatically reducing variance while maintaining unbiasedness.
Suppose we wish to sample from a complicated target distribution:
\[
p_0(\mathbf{x}) \propto f_0(\mathbf{x}),
\]
where sampling is difficult because \(p_0\) may be high-dimensional and/or multimodal.
Suppose that we have a proposal distribution from which we can easily sample (e.g., prior distribution):
\[
p_n(\mathbf{x}) \propto f_n(\mathbf{x}).
\]
We construct a sequence of \(n+1\) intermediate distributions moving from \(p_n\) to \(p_0\) as follows:
\[
f_j (\mathbf{x}) = f_0(\mathbf{x})^{\beta_j}f_n(\mathbf{x})^{1 - \beta_j}
\]
where \(1 = \beta_0 > \beta_1 > \cdots > \beta_n = 0\) defines an annealing schedule.
Each \(\beta_j\) acts as an inverse temperature parameter interpolating between the proposal and the target.
Typical schedules include:
- Linear: \(\beta_j = \frac{n - j}{n}\)
- Geometric: \(\beta_j = \left( \frac{n - j}{n} \right)^{\alpha}\), for some \(\alpha > 0\)
- Adaptive: chosen to maintain approximately constant effective sample size (ESS) between steps
To sample from each intermediate distribution \(p_j\), we use a Markov chain transition kernel:
\[
T_j(\mathbf{x}, \mathbf{x}') = p_j(\mathbf{x}' \mid \mathbf{x})
\]
which leaves \(p_j\) invariant, i.e.,
\[
\int p_j(\mathbf{x})T_j(\mathbf{x}, \mathbf{x}')d\mathbf{x} = p_j(\mathbf{x}').
\]
Common choices for \(T_j\) include Metropolis-Hastings or Hamiltonian Monte Carlo,
applied for a few iterations to adapt samples toward \(p_j\).
The AIS procedure proceeds as follows:
- Sample \(\mathbf{v}_n \sim p_n\) from the proposal distribution.
- Apply successive Markov transitions to obtain \(\mathbf{v}_0\):
\[
\mathbf{v}_{n-1} \sim T_{n-1}(\mathbf{v}_n, \cdot), \quad
\mathbf{v}_{n-2} \sim T_{n-2}(\mathbf{v}_{n-1}, \cdot), \; \ldots, \;
\mathbf{v}_0 \sim T_0(\mathbf{v}_1, \cdot).
\]
Each AIS run yields a trajectory
\((\mathbf{v}_n, \mathbf{v}_{n-1}, \ldots, \mathbf{v}_0)\)
with an associated importance weight:
\[
w =
\frac{f_{n-1}(\mathbf{v}_{n-1})}{f_n(\mathbf{v}_{n-1})}
\frac{f_{n-2}(\mathbf{v}_{n-2})}{f_{n-1}(\mathbf{v}_{n-2})}
\cdots
\frac{f_1(\mathbf{v}_1)}{f_2(\mathbf{v}_1)}
\frac{f_0(\mathbf{v}_0)}{f_1(\mathbf{v}_0)}. \tag{1}
\]
Each ratio is typically close to 1, which keeps the weights stable and reduces variance.
Proof:
Consider the distribution on an extended state space
\(\mathbf{v} = (\mathbf{v}_0, \ldots, \mathbf{v}_n)\):
\[
\begin{align*}
p(\mathbf{v}) &\propto \varphi(\mathbf{v}) = f_0(\mathbf{v})\tilde{T}_0 (\mathbf{v}_0, \mathbf{v}_1) \\\\
&\propto p(\mathbf{v}_0)p(\mathbf{v}_1 \mid \mathbf{v}_0) \cdots p(\mathbf{v}_n \mid \mathbf{v}_{n-1})
\end{align*}
\]
where \(\tilde{T}_j\) is the reversal of \(T_j\):
\[
\begin{align*}
\tilde{T}_j(\mathbf{v}, \mathbf{v}') &= T_j(\mathbf{v}', \mathbf{v})p_j(\mathbf{v}') / p_j(\mathbf{v}) \\\\
&= T_j(\mathbf{v}', \mathbf{v})f_j(\mathbf{v}') / f_j(\mathbf{v})
\end{align*}
\]
The invariance of \(p_j\) with respect to \(T_j\) ensures that these are valid transition probabilities.
It is clear that
\[
\sum_{\mathbf{v}_1, \cdots, \mathbf{v}_n} \varphi(\mathbf{v}) = f_0(\mathbf{v}_0),
\]
and thus by sampling from \(p(\mathbf{v})\), we can effectively sample from \(p_0(\mathbf{x})\).
Moreover, we can sample on this extended state space using the AIS procedure, which corresponds to
the following proposal:
\[
\begin{align*}
q(\mathbf{v}) &\propto g(\mathbf{v}) = f_n(\mathbf{v}_n)T_{n-1}(\mathbf{v_n}, \mathbf{v}_{n-1})\cdots T_2(\mathbf{v}_2, \mathbf{v}_1)T_0(\mathbf{v}_1, \mathbf{v}_0) \\\\
&\propto p(\mathbf{v}_n)p(\mathbf{v}_{n-1} \mid \mathbf{v}_n)\cdots p(\mathbf{v}_1 \mid \mathbf{v}_0).
\end{align*}
\]
One can show that importance weights:
\[
w = \frac{\varphi(\mathbf{v}_0, \cdots, \mathbf{v}_n)}{g(\mathbf{v}_0, \cdots, \mathbf{v}_n)}
\]
are given by Equation (1). Since marginals of the sampled sequences from this extended model are equivalent to
samples from \(p_0(\mathbf{x})\), we see that we are using the correct weights.
A key application of AIS is to estimate the ratio of partition functions. Since
\[
Z_0 = \int f_0 (\mathbf{x})d\mathbf{x} = \int \varphi(\mathbf{v})d\mathbf{v}
\]
and
\[
Z_n = \int f_n (\mathbf{x})d\mathbf{x} = \int g(\mathbf{v})d\mathbf{v}
\]
we have:
\[
\begin{align*}
\frac{Z_0}{Z_n} &= \frac{\int \varphi(\mathbf{v})d\mathbf{v}}{\int g(\mathbf{v})d\mathbf{v}} \\\\
&= \frac{\int \frac{\varphi(\mathbf{v})}{g(\mathbf{v})}g(\mathbf{v})d\mathbf{v}} {\int g(\mathbf{v})d\mathbf{v}} \\\\
&= \mathbb{E}_g \left[ \frac{\varphi(\mathbf{v})}{g(\mathbf{v})}\right] \\\\
&\approx \frac{1}{S}\sum_{s=1}^S W_s.
\end{align*}
\]
where \(W_s = \varphi(\mathbf{v}_s) / g(\mathbf{v}_s)\), and and \(S\) is the number of independent AIS runs.
If \(f_0\) represents a prior and \(f_n\) the posterior, we can estimate the model evidence (marginal likelihood) \(Z_n = p(\mathcal{D})\) using the above equation,
provided the normalization constant \(Z_0\) of the prior is known.
The variance of AIS can be reduced by increasing the number of intermediate distributions \(n\). In practice,
\(n = 100\) to \(n = 10000\) steps is common. The optimal choice balances computational cost (proportional to \(n\))
against variance reduction. For example, TensorFlow Probability uses 1000 steps as default.
When implementing AIS in practice, several considerations can dramatically impact performance:
1. Choosing the Annealing Schedule
The choice of \(\{\beta_j\}\) is crucial. Recent research suggests:
- Moment matching: Choose \(\beta_j\) such that the second moments of successive distributions are approximately equal
- Constant ESS: Adapt \(\beta_j\) to maintain ESS \(\approx 0.8 \times N_s\) between steps
- Sigmoid schedules: \(\beta_j = \sigma(a(2j/n - 1))\) where \(\sigma\) is the sigmoid function and \(a\) controls steepness
2. Transition Kernels
The choice and tuning of \(T_j\) significantly affects efficiency:
- Hamiltonian Monte Carlo (HMC): Often most effective for continuous distributions, especially in high dimensions
- Number of MCMC steps: Usually 1-10 steps per temperature is sufficient; more steps increase cost with diminishing returns
- Step size adaptation: Crucial for HMC; can be adapted based on acceptance rate at each temperature
3. Modern Applications in Deep Learning
AIS has become particularly important in deep learning for:
- Evaluating generative models: Estimating log-likelihoods of VAEs, normalizing flows, and energy-based models
- Bayesian deep learning: Computing model evidence for neural network architectures
- Differentiable AIS: Recent work enables gradient-based optimization through AIS estimates using techniques like the reparameterization trick
Diagnostics and Debugging
Importance sampling can fail silently, producing estimates with high bias or variance.
Here are essential diagnostics to monitor:
1. Weight Diagnostics
Key Diagnostic Metrics
- Effective Sample Size (ESS): Should be \(> N_s/2\) for reliable estimates
- Maximum weight ratio: \(\max_i W_i\) should be \(< 0.1\) to avoid single-sample domination
- Coefficient of variation: \(\text{CV} = \frac{\text{std}(\tilde{w})}{\text{mean}(\tilde{w})}\) should be \(< 1\) for stable estimation
- Weight entropy: \(H(W) = -\sum_i W_i \log W_i\) measures weight uniformity (higher is better)
2. Common Failure Modes and Solutions
Symptom |
Likely Cause |
Solution |
ESS ≪ \(N_s\) |
Poor proposal choice |
Use heavier-tailed proposal or AIS |
High variance across runs |
Weight degeneracy |
Increase samples or improve proposal |
Biased estimates |
Violated support condition |
Ensure \(q(\mathbf{x}) > 0\) wherever \(\pi(\mathbf{x}) > 0\) |
Numerical instability |
Extreme weight ratios |
Work in log-space; use log-sum-exp trick |
3. Validation Techniques
To verify your importance sampling implementation:
- Known ground truth: Test on problems with analytical solutions
- Convergence check: Estimate should stabilize as \(N_s\) increases
- Multiple proposals: Different valid proposals should give consistent estimates
- Bootstrap confidence intervals: Resample weights to assess uncertainty