Principal Component Analysis (PCA) & Autoeocoders

Recap & Introduction

Principal Component Analysis (PCA) is one of the most fundamental techniques in machine learning and statistics for dimensionality reduction. It provides a method to reduce the number of variables in high-dimensional datasets while retaining the most meaningful structure and variation present in the original data.

PCA begins by analyzing the covariance structure of the data. Given a dataset, we compute the covariance matrix to understand how different features co-vary. Since the covariance matrix is always symmetric and positive semi-definite, it can be orthogonally diagonalized. The resulting eigenvectors, called principal components (PCs), form an orthonormal basis that captures the directions of maximum variance. The corresponding eigenvalues indicate how much of the total variance is captured by each component.

By projecting the data onto the subspace spanned by the top \(k\) principal components (those associated with the largest eigenvalues), PCA identifies a lower-dimensional representation that retains as much variance as possible. This allows us to reduce dimensionality by discarding less informative directions (i.e., those with small variance), thereby simplifying the dataset while minimizing information loss.

For example, if a dataset in \(\mathbb{R}^{10}\) has 3 principal components capturing 95% of the total variance, we can project the original data onto a 3D subspace. This projection preserves the dominant patterns and relationships in the data and filters out noise and redundancy. The transformed vectors in the low-dimensional space are known as latent representations.

While PCA is a powerful tool, it identifies directions of maximum variance using linear combinations of the original features. As a result, it may fail to uncover complex, nonlinear structures in the data.

Kernel PCA

Given the \(N \times D\) data matrix. PCA is a problem of determining the eigenvectors of the \(D\times D\) covariance matrix \(X^\top X\). If \(D >> N\), working with the \(N \times N\) Gram matrix \(K = XX^\top\) is much efficient. The Gram matrix is just a matrix of inner products \(x_i^\top x_j\).

Kernel PCA (KPCA) is a nonlinear generalization of classical PCA that uses the kernel trick, which allows us to replace inner products \(x_i^\top x_j\) with a kernel function \(K_{ij} = \mathcal{K}(x_i, x_j)\).

The kPCA implicitly replaces \(x_i\) with \(\phi(x_i) = \phi_i\). Let \(\Phi\) be the corresponding design matrix. Assuming the features are centered, the covariance matrix in feature space is represented by: \[ S_{\phi} = \frac{1}{N} \sum_i \phi_i \phi_i^T. \] The normalized eigenvectors of \(S\) are give by: \[ V_{kPCA} = \Phi^\top U \Lambda^{-\frac{1}{2}} \] where \(U\) is an orthogonal matrix containing the eigenvectors of the Gramm matrix \(K = XX^\top\) with corresponding eigenvalues in \(\Lambda\).

Since \(\phi_i\) can be infinite dimensional, we cannot compute \(V_{kPCA}\). Instead, we compute the projection of a test vector \(x_*\) onto the feature space: \[ \phi_*^\top V_{kPCA} = \phi_*^\top \Phi^\top U \Lambda^{1\frac{1}{2}} = k_*^\top U \Lambda^{-\frac{1}{2}} \] where \(k_* = \left[\mathcal{K}(x_* , x_1), \cdots, \mathcal{K}(x_* , x_N) \right]\).

Note that using \(K = \Phi \Phi^\top\) is valid only if \(\mathbb{E}[\phi_i] = 0\). However, the feature space can be infinite dimensional, so we cannot subtract off the mean. Here, we introduce the double centering trick. Let the centered feature vector be \[ \tilde{\phi}_i = \phi(x_i) - \frac{1}{N} \sum_{j =1}^N \phi(x_j). \] Its Gram matrix is give by \[ \tilde{K}_{ij} = \tilde{\phi}_i^\top \tilde{\phi}_j. \] By the double centering trick, \[ \begin{align*} \tilde{K} &= C_N K C_N \\\\ &= K - \frac{1}{N}JK - \frac{1}{N}KJ + \frac{1}{N^2}1^\top K 1 \end{align*} \] where \[ C_N = I_N - \frac{1}{N}1_N 1_N^\top \] is the centering matrix.
Note: \(J\) denotes the \(N \times N\) all-ones matrix: \(J = 1_N 1_N^\top\).

Note: To verify the double centering trick, consider the scalar form: \[ \begin{align*} \tilde{K}_{ij} &= \tilde{x}_i^\top \tilde{x}_j \\\\ &= \left(x_i - \frac{1}{N} \sum_{k=1}^N x_k \right)\left(x_j - \frac{1}{N} \sum_{l=1}^N x_l \right) \\\\ &= x_i^\top x_j - \frac{1}{N} \sum_{l=1}^N x_i^\top x_l - \frac{1}{N} \sum_{l=1}^N x_j^\top x_k + \frac{1}{N^2} \sum_{k=1}^N \sum_{l =1}^N x_k^\top x_l. \end{align*} \]

Autoencoders

Data reconstruction serves as the primary quality control mechanism for dimensionality reduction. When we compress data from the original high-dimensional feature space \(\mathbb{R}^D\) to low-dimensional space \(\mathbb{R}^L\) (where \(L < D\)), we need to ensure that the essential structure and information of the original data is preserved. The reconstruction error provides a quantitative measure of information loss. If reconstruction is poor, the learned representation is inadequate for the task at hand.

Reconstruction ensures that the learned latent representation \(z = f_e(x)\) captures the most relevant and meaningful features of the data. The fuction \(f_e : x \to z\) is called the encoder. If the decoder \(f_d : z \to x\) can successfully reconstruct the original input \(x\) from \(z\), it demonstrates that \(z\) contains sufficient information about the underlying data structure.

We can consider PCA as the process of learning linear mappings such as \(f_e\) and \(f_d\). Then the reconstruction function can be represented as \(r(x) = f_d \left(f_e(x) \right)\), which is trained to minimize \(\mathcal{L}(\theta) = -\log p(x | r(x))\).

We can implement \(f_e\) and \(f_e\) by neural networks. This is called autoencoder. Especially, A linear autoencoder is equivalent to PCA:

Input: \(x \in \mathbb{R}^D\)
Hidden units: \(z = W_1 x, \quad W_1 \in \mathbb{R}^{L \times D}, \quad L < D\)
Output: \(\hat{x} = W_2 z = W_2 W_1 x = Wx, \quad W_2 \in \mathbb{R}^{D \times L}\)

This model is trained by minimizing the squared reconstruction error: \[ \mathcal{L}(W) = \sum_{n=1}^N \| x_n - W x_n\|_2^2 \] Then we obtain \(\hat{W}\), an orthgonal projection onto the first L eigenvectors of empirical covariance matrix of the data. In general, it is known that if we introduce nonlinearities into the autoencoder, the model becomes strictly more powerful than PCA. Moreover, autoencoders designed with deep learning architectures handle large datasets and complex structures much better than Kernel PCA, so autoencoders have generally become more popular than Kernel PCA for nonlinear mapping in practical applications.

Demo: Kernel PCA and Autoencoders

This interactive demo allows you to explore and compare different dimensionality reduction techniques:

PCA Comparison Tab

Compare how standard PCA and Kernel PCA handle non-linear datasets. Standard PCA finds linear projections, while Kernel PCA can "unfold" complex structures like circles and spirals using the kernel trick. Try different kernels (RBF, polynomial) and adjust parameters to see their effects.

Autoencoder Tab

Explore how a neural network autoencoder with a 1D bottleneck learns to compress and reconstruct 2D data. Unlike PCA which finds linear subspaces, the autoencoder learns non-linear mappings through its hidden layers. The 1D latent space visualization shows how the network arranges data points along a single dimension, effectively finding the principal curve through the data.

💡 Tip: Start with the "circles" dataset to see how each method handles non-linear structure. PCA will struggle, Kernel PCA (with RBF kernel) will separate the circles, and the autoencoder will "unfold" them into a line.

Denoising Autoencoders

Denoising autoencoders (DAE) are a more regularized variant of standard autoencoders that add noise to the input during training, then learn to reconstruct the original, uncorrupted data. This seemingly simple modification has profound theoretical implications.

The training process involves corrupting the input \(x\) to produce \(\tilde{x}\), typically using Gaussian noise: \[ p_c (\tilde{x} | x) = \mathcal{N}(\tilde{x} | x, \sigma^2 I) \] The model then minimizes the reconstruction error between its output \(r(\tilde{x})\) and the clean input \(x\): \[ \ell (x, r(\tilde{x})) = \| e \|_2^2 \] where \(e(x) = r(\tilde{x}) - x \) is the residual error for a sample \(x\).

A remarkable result is that, as \(\sigma \to 0\), the optimal denoising function satisfies: \[ e(x) = r(\tilde{x}) - x \approx \nabla_{x} \log p(x). \] This means the denoising autoencoder implicitly learns the score function (gradient of log-density), which forms a vector field over the entire feature space. At each point, this vector field indicates the direction and magnitude to move toward regions of higher data density. The reconstruction process follows these vectors, effectively "flowing" corrupted points back to the data manifold along the steepest ascent of the probability landscape.

Data Manifolds in Our Demo:

The autoencoder tab in our demo visualizes data manifolds. When the 2D data is compressed through a 1D bottleneck, the network must learn the underlying 1-D manifold (curve) that best represents the data. The "Reconstructed" visualization shows points projected onto this learned manifold This is the autoencoder discovering and representing the intrinsic lower-dimensional structure of the data.

Note: Lipschitz continuity is a common regularity condition that ensures a function does not change too rapidly.

Definition: Given two metric spaces \((X, d_X)\) and \((Y, d_Y)\). A function \(f : X \to Y\) is called Lipschitz continuous if there exists a constant \(L > 0\) such that: \[ \forall \, x_1, x_2 \in X \quad d_Y (f(x_1), f(x_2))\leq L d_X(x_1, x_2) \] where \(L\) is called a Lipschitz constant.

This means the function's output changes at most linearly with respect to the input. In autoencoders, Lipschitz continuity in the reconstruction map \(r(x)\) ensures stability — small changes in input lead to small changes in reconstruction.

In the context of denoising autoencoders, enforcing or assuming Lipschitz continuity makes the learned vector field well-behaved. It guarantees smooth flows along the data manifold and avoids sharp or unstable reconstructions, which is essential when approximating the gradient \(\nabla_x \log p(x)\).

Manifolds

Intuitively, manifold is a topological space that locally resembles Euclidean space near each point. Imagine a curved surface like a sphere—zooming in on any small patch makes it look flat, like \(\mathbb{R}^2\). In machine learning, we often assume data lies on a low-dimensional manifold embedded in high-dimensional space.

Definition: An \(n\)-dimensional manifold \(\mathcal{M}\) is a topological space where every point \(p \in \mathcal{M}\) has an open neighborhood \(U\) that is homeomorphic to \(\mathbb{R}^n\). That is, there exists a continuous bijection \(\phi : U \to \mathbb{R}^n \) with continuous inverse.

Note: Two topological spaces \(X\) and \(Y\) are homeomorphic if there exists a function \(f: X \to Y\) such that: \(f\) is bijection, continuous, and its inverse \(f^{-1}: Y \to X \) is also continuous.

The Manifold Hypothesis: Real-world high-dimensional data (e.g., images, speech) tends to lie on or near a low-dimensional manifold embedded in the ambient space. For example, each \(64 \times 64\) grayscale face image can be represented as a point in \(\mathbb{R}^{4096}\), but the set of "realistic" face images occupies only a small, structured region of this space — likely a nonlinear manifold of much lower dimension, governed by factors such as pose, lighting, expression, and identity.