Gaussian Processes

Probability & Statistics

Introduction Mercer Kernels GP Regression

Introduction

So far, we have discussed parametric models. For example, in deep neural networks, we use flexible function approximators of the form \(f(\boldsymbol{x}; \boldsymbol{\theta})\), where the number of parameters \(\dim(\boldsymbol{\theta})\) is fixed and independent of the size of the training set \(N\). Recall that such models can overfit when \(N\) is too small or underfit when \(N\) is too large. In contrast, we now consider models whose effective dimensionality adapts to the amount of data \(N\); these are called nonparametric models. From an AI systems perspective, most modern architectures are hybrid: they combine a parametric core — such as a Transformer — with nonparametric components used for memory, retrieval, or adaptive layers.

Here, we focus on the Gaussian Process (GP), which is one of the Bayesian approaches. In this framework, we represent uncertainty about the mapping \(f\) by placing a prior distribution over functions, \(p(f)\), and then update it after observing data to obtain the posterior distribution \(p(f \mid \mathcal{D})\).

Consider a Gaussian random vector \(\boldsymbol{f} = [f_1, f_2, \ldots, f_N]^\top\), which is characterized by its mean \(\boldsymbol{\mu} = \mathbb{E}[\boldsymbol{f}]\) and covariance \(\boldsymbol{\Sigma} = \mathrm{Cov}[\boldsymbol{f}]\).

Now, let \(f : \mathcal{X} \rightarrow \mathbb{R}\) be a function evaluated at a set of input points \(\boldsymbol{X} = \{\boldsymbol{x}_n \in \mathcal{X}\}_{n=1}^N\), and define \(\boldsymbol{f}_X = [f(\boldsymbol{x}_1), f(\boldsymbol{x}_2), \ldots, f(\boldsymbol{x}_N)]^\top\). If \(\boldsymbol{f}_X\) follows a joint Gaussian distribution for any finite set of inputs \(N \ge 1\), then \[ f : \mathcal{X} \rightarrow \mathbb{R} \] is said to follow a Gaussian Process (GP), defined by its mean function \(m(\boldsymbol{x}) = \mathbb{E}[f(\boldsymbol{x})]\) and covariance function \(\mathcal{K}(\boldsymbol{x}, \boldsymbol{x}') = \mathrm{Cov}[f(\boldsymbol{x}), f(\boldsymbol{x}')]\), where \(\mathcal{K}\) is any positive-definite Mercer kernel. A common example is the RBF kernel: \[ \mathcal{K}(\boldsymbol{x}, \boldsymbol{x}') \propto \exp(-\|\boldsymbol{x} - \boldsymbol{x}'\|^2). \]

The corresponding GP prior is denoted as \[ f(\boldsymbol{x}) \sim \mathcal{GP}(m(\boldsymbol{x}), \mathcal{K}(\boldsymbol{x}, \boldsymbol{x}')). \] Here, \(m(\boldsymbol{x}) = \mathbb{E}[f(\boldsymbol{x})]\) and \(\mathcal{K}(\boldsymbol{x}, \boldsymbol{x}') = \mathbb{E}[(f(\boldsymbol{x}) - m(\boldsymbol{x}))(f(\boldsymbol{x}') - m(\boldsymbol{x}'))]\).

For any finite collection of input points \(\boldsymbol{X} = \{\boldsymbol{x}_1, \boldsymbol{x}_2, \ldots, \boldsymbol{x}_N\}\), the corresponding function values follow a multivariate normal distribution: \[ p(\boldsymbol{f}_X \mid \boldsymbol{X}) = \mathcal{N}(\boldsymbol{f}_X \mid \boldsymbol{\mu}_X, \boldsymbol{K}_{X,X}), \] where \(\boldsymbol{\mu}_X = [m(\boldsymbol{x}_1), m(\boldsymbol{x}_2), \ldots, m(\boldsymbol{x}_N)]^\top\) and \(\boldsymbol{K}_{X,X}(i, j) = \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)\).

Mercer Kernels

The kernel function defines the notion of similarity between function values at different input points. In a Gaussian process (GP), the choice of kernel is crucial since it determines the smoothness, periodicity, and other structural properties of the functions drawn from the GP prior.

While the kernel function intuitively measures similarity between inputs, not every function can serve as a valid covariance function in a Gaussian process. To ensure that the resulting covariance matrix is positive definite for any finite set of inputs — a requirement for the GP to define a proper multivariate Gaussian distribution — we restrict our attention to Mercer kernels.

Mercer Kernels: Mercer kernel is also called a positive definite kernel, which is any symmetric function: \[ \mathcal{K}: \mathcal{X} \times \mathcal{X} \rightarrow \mathbb{R} \] such that for all \(N \in \mathbb{N}\) and all \(\boldsymbol{x}_i \in \mathcal{X}\) \[ \sum_{i=1}^N \sum_{j=1}^N c_i c_j \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j) \geq 0 \] for any choice of numbers \(c_i \in \mathbb{R}\). The equality allowed when \(\forall i, c_i =0\).
This means that the corresponding Gram matrix \( \boldsymbol{K} = [\mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)]_{i,j=1}^N \) is positive semidefinite. Note that this condition does NOT require each individual kernel value \(\mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j)\) to be nonnegative.

We often focus on stationary kernels, which depend only on the difference between inputs, \(\boldsymbol{r} = \boldsymbol{x} - \boldsymbol{x}'\), and not on their absolute locations. In many cases, only the Euclidean distance between inputs matters: \[ r = \|\boldsymbol{r}\|_2 = \|\boldsymbol{x} - \boldsymbol{x}'\|. \]

As we have seen, one of the most widely used stationary kernel in practice is a radial basis function (RBF) kernel: \[ \begin{align*} \mathcal{K}_\text{RBF}(r ; \ell) &= \exp \left( -\frac{r^2}{2\ell^2}\right) \\\\ &= \exp \left( -\frac{\| \boldsymbol{x} - \boldsymbol{x}'\|^2}{2\ell^2}\right) \end{align*} \] where \(\ell > 0\) is the length-scale parameter controlling how quickly the correlation between points decays with distance. This kernel is also known as the Gaussian kernel.

Furthermore, by replacing Euclidean distance with Mahalanobis distance, we obtain the generalized RBF kernel: \[ \mathcal{K}(\boldsymbol{r}; \boldsymbol{\Sigma}, \sigma^2) = \sigma^2 \exp \left( - \frac{1}{2} \boldsymbol{r}^\top \boldsymbol{\Sigma}^{-1} \boldsymbol{r}\right). \] In the case \(\boldsymbol{\Sigma}\) is diagonal, i.e. \(\boldsymbol{\Sigma} = \mathrm{diag}(\ell_1^2, \ldots, \ell_D^2)\), we obtain the Automatic Relevance Determination (ARD) kernel: \[ \mathcal{K}_\text{ARD}(\boldsymbol{r}; \boldsymbol{\Sigma}, \sigma^2) = \sigma^2 \exp \left( - \frac{1}{2} \sum_{d=1}^D \frac{1}{\ell_d^2 }r_d^2 \right) \] where \(\sigma^2\) is the overall variance and \(\ell_d >0\) is the characteristic length scale of dimension \(d\), controlling the sensitivity of the function to that input. If a dimension \(d\) is irrelevant, we can set \(\ell_d = \infty\), effectively ignoring that input.

While RBF and ARD kernels are infinitely smooth and often used as default choices, there are situations where we may want more flexible control over the smoothness of the functions. For example, in Bayesian optimisation or when modelling rougher processes, it is useful to have kernels whose sample paths are only once or twice differentiable.

This motivates the use of Matérn kernels, which form a family of stationary kernels that include the RBF kernel as a limiting case (\(\nu \to \infty\)), while introducing an explicit smoothness parameter \(\nu > 0\) to control the differentiability of the functions. By adjusting \(\nu\), we can interpolate between very rough kernels (like the exponential kernel) and infinitely smooth kernels (like the RBF), giving us fine-grained control over function smoothness.

For parameters \(\nu>0\), \(\ell>0\), the Matérn kernel is \[ \mathcal{K}_\text{Matérn}(r; \nu, \ell ) = \frac{2^{1-\nu}}{\Gamma(\nu)}\left(\frac{\sqrt{2\nu}r}{\ell}\right)^\nu K_\nu\left(\frac{\sqrt{2\nu}r}{\ell}\right) \]

where \(K_\nu\) denotes the modified Bessel function of the second kind. The parameter \(\nu\) controls smoothness: sample paths of the GP are \(\lfloor \nu - \tfrac{1}{2}\rfloor\)-times mean-square differentiable. Common choices (up to the variance factor \(\sigma^2\)) admit closed forms:


Note: In Gaussian process regression, the kernel can be multiplied by a positive amplitude parameter \(\sigma^2 > 0\) to scale the variance of the function values: \[ \mathcal{K}(r;\nu,\ell) \;\longrightarrow\; \sigma^2 \mathcal{K}(r;\nu,\ell). \] This scaling changes the magnitude of function fluctuations but does not affect the smoothness or correlation structure of the GP.

Some functions exhibit repeating or periodic behavior that is not captured by standard RBF or Matérn kernels. To model such patterns, we can use a periodic kernel, which explicitly encodes periodicity into the covariance function.

A common choice for a one-dimensional periodic kernel is:

\[ \mathcal{K}_\text{per}(r; \ell, p) = \exp \Bigg( - \frac{2 \, \sin^2\big(\pi r / p\big)}{\ell^2} \Bigg), \]

where \(p > 0\) is the period of the function and \(\ell > 0\) is the length-scale controlling how quickly correlations decay away from exact multiples of the period.

Periodic kernels are particularly useful in Gaussian process regression for modeling phenomena that repeat regularly over time or space, such as seasonal time series, cyclic patterns, or physical systems with inherent periodicity.

GP Regression

Suppose we observe a training data set \(\mathcal{D} = \{(\boldsymbol{x}_n, y_n)\}_{n=1}^N\), where each observation follows \[ y_n = f(\boldsymbol{x}_n) + \epsilon_n, \quad \epsilon_n \sim \mathcal{N}(0, \sigma_y^2). \] Under this model, the covariance of the noisy observations is \[ \begin{align*} \mathrm{Cov}[y_i, y_j] &= \mathrm{Cov}[f_i, f_j] + \mathrm{Cov}[\epsilon_i, \epsilon_j] \\\\ &= \mathcal{K}(\boldsymbol{x}_i, \boldsymbol{x}_j) + \sigma_y^2 \, \mathbb{I}(i = j), \end{align*} \] so that \[ \mathrm{Cov}[\boldsymbol{y} \mid \boldsymbol{X}] = \boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N. \] Because a Gaussian process defines a joint Gaussian distribution over any collection of function values, we can express both the noisy training outputs and the noise-free test values together as \[ \begin{bmatrix} \boldsymbol{y} \\ \boldsymbol{f}_* \end{bmatrix} \sim \mathcal{N} \Bigg( \begin{bmatrix} \boldsymbol{\mu}_X \\ \boldsymbol{\mu}_* \end{bmatrix}, \begin{bmatrix} \boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N & \boldsymbol{K}_{X,*} \\\\ \boldsymbol{K}_{X,*}^\top & \boldsymbol{K}_{*,*} \end{bmatrix} \Bigg). \] The posterior predictive distribution at the test points \(\boldsymbol{X}_*\) is obtained by conditioning on the observed data: \[ p(\boldsymbol{f}_* \mid \mathcal{D}, \boldsymbol{X}_*) = \mathcal{N}\big( \boldsymbol{f}_* \mid \boldsymbol{\mu}_{* \mid X}, \boldsymbol{\Sigma}_{* \mid X} \big), \] where \[ \boldsymbol{\mu}_{* \mid X} = \boldsymbol{\mu}_* + \boldsymbol{K}_{X,*}^\top \big(\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N \big)^{-1} (\boldsymbol{y} - \boldsymbol{\mu}_X), \] \[ \boldsymbol{\Sigma}_{* \mid X} = \boldsymbol{K}_{*,*} - \boldsymbol{K}_{X,*}^\top \big(\boldsymbol{K}_{X,X} + \sigma_y^2 \boldsymbol{I}_N \big)^{-1} \boldsymbol{K}_{X,*}. \]