Clustering

Introduction

Clustering is one of the core tasks in unsupervised learning. Unlike supervised learning, where models learn from labeled data, unsupervised learning aims to uncover hidden patterns or structures in data without any predefined labels. Among these tasks, clustering is particularly important — it groups data points into clusters based on their similarity, revealing the underlying structure of the data.

In previous sections, we explored techniques like Principal Component Analysis (PCA) and autoencoders, which reduce the dimensionality of high-dimensional data while preserving meaningful information. These dimensionality reduction methods helped us uncover compact representations of data, often referred to as latent features.

Now that we have the tools to represent complex data in a lower-dimensional space, we turn to the next natural question: Can we identify meaningful groupings within this reduced space? This is the goal of clustering, which allows us to segment data into coherent groups — for example, grouping similar images, organizing unlabeled documents, or detecting communities in social networks.

K-means Clustering

We begin with K-means clustering, one of the most widely used and intuitive algorithms.

We assume that there are \(K\) cluster centers \(\mu_k \in \mathbb{R}^D\), for \(k = 1, \cdots, K\). Each data point \(x_n \in \mathbb{R}^D\) is assigned to its nearest center: \[ z_n^* = \arg \min_k \| x_n - \mu_k \|_2^2. \]

Given the assignments, the cluster centers are updated as the mean of points in each cluster: \[ \mu_k = \frac{1}{N_k} \sum_{n : z_n =k} x_n, \,\text{where } N_k = \sum_{n=1}^N \mathbb{I}[z_n = k]. \]

These two steps — assignment and update — are repeated until convergence. This process can be viewed as minimizing a cost function known as the distortion, which is equivalent to the squared Frobenius norm of the reconstruction error: \[ \begin{align*} J(M, Z) &= \sum_{n = 1}^N \| x_n - \mu_{z_n} \|_2^2 \\\\ &= \| X - ZM^\top \|_F^2 \end{align*} \] where \(X \in \mathbb{R}^{N \times D}\), \(Z \in \{0, 1\} ^{N \times K}\), and \(M \in \mathbb{R}^{D \times K}\) contains the cluster centers \(\mu_k\) in its columns.
Note: \(Z\) is the one-hot encoding(or dummy encoding) matrix whose entries are \[ Z_{nk} = \begin{cases} 1, &\text{ if \(x_n\) is assigned to cluster \(k\) } \\ 0, &\text{ otherwise}. \end{cases} \]

Vector Quantization (VQ)

A key idea of vector quantization is to compress data by replacing each high-dimensional vector \(x_n \in \mathbb{R}^D\) with a discrete symbol \(z_n \in \{1, \cdots, K\}\), which is an index into a codebook of \(K\) prototype vectors, \(\mu_k \in \mathbb{R}^D\).

Each data point is encoded by finding the index of the closest prototype using Euclidean distance: \[ z_n := f_e (x_n) = \arg \min_k \| x_n - \mu_k \|^2. \]

The corresponding decoder simply maps the index back to the prototype: \[ f_d(k) = \mu_k. \] The reconstruction of each data point is therefore \(\hat{x}_n = f_d(z_n) = \mu_{z_n}\).

The quality of a codebook is measured using the reconstruction error (also called distortion): \[ \begin{align*} J &= \frac{1}{N} \sum_{n =1}^N \| x_n - f_d (f_e (x_n))\|^2 \\\\ &= \frac{1}{N} \sum_{n=1}^N \|x_n - \mu_{z_n} \|^2. \end{align*} \] Indeed, this is equivalent to the cost function minimized by the K-means algorithm, which can be interpreted as a special case of vector quantization where the codebook is learned by minimizing distortion via iterative updates. VQ is more focused on signal compression and encoding, whereas K-means is usually employed for data analysis and clustering tasks.

Spectral Clustering

Traditional clustering methods like K-means assume clusters are linearly separable and spherical in shape. However, many real-world datasets exhibit complex, non-convex structures that are not well-captured by distance alone. Spectral clustering addresses this limitation by transforming the data into a new space that reflects the connectivity structure of the data, often represented as a graph.

The idea is to build a similarity graph where each data point is a node, and edges encode pairwise similarities. Let \(\mathbf{W} \in \mathbb{R}^{N \times N}\) be a symmetric weight matrix for a graph such that \(w_{ij} = w_{ji} \geq 0\) measures the similarity between points \(i\) and \(j\). The degree of node \(i\) is defined as: \[ d_i = \sum_{j=1}^N w_{ij}, \] and we define the degree matrix \(\mathbf{D} \in \mathbb{R}^{N \times N}\) as a diagonal matrix with entries \(D_{ii} = d_i\).

The fundamental object in spectral clustering is the graph Laplacian, defined as: \[ \mathbf{L} = \mathbf{D} - \mathbf{W}. \] So, the elements of \(\mathbf{L}\) are given by \[ \begin{cases} d_i &\text{if \(i = j\)} \\ -w_{ij} &\text{if } i \neq j \text{ and } w_{ij} \neq 0\\ 0 &\text{Otherwise}. \end{cases} \] The graph Laplacian captures the structure of the graph in a way that supports clustering via eigenvector analysis.

Theorem: Let \(G\) be an undirected (possibly weighted) graph with Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\). Then the number of connected components of \(G\) is equal to the multiplicity \(k\) of the eigenvalue 0 of \(\mathbf{L}\). Moreover, the eigenspace corresponding to eigenvalue 0 is spanned by the indicator vectors \(\mathbf{1}_{S_1}, \ldots, \mathbf{1}_{S_k}\), where each \(S_j\) is a connected component of \(G\).

This makes spectral methods especially powerful for separating well-connected groups in a graph.

Proof: Let \(G = (V, E)\) be an undirected (possibly weighted) graph with Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\), where \(\mathbf{W}\) is a symmetric weight matrix and \(\mathbf{D}\) is the diagonal degree matrix \(D_{ii} = \sum_j w_{ij}\).

For any vector \(\boldsymbol{f} \in \mathbb{R}^N\), the quadratic form of the Laplacian is: \[ \boldsymbol{f}^\top \mathbf{L} \boldsymbol{f} = \boldsymbol{f}^\top (\mathbf{D} - \mathbf{W})\boldsymbol{f} = \sum_i D_{ii} f_i^2 - \sum_{i,j} w_{ij} f_i f_j. \] Since \(D_{ii} = \sum_j w_{ij}\), we can write: \[ \sum_i D_{ii} f_i^2 = \sum_{i,j} w_{ij} f_i^2. \] So the full expression becomes: \[ \boldsymbol{f}^\top \mathbf{L} \boldsymbol{f} = \sum_{i,j} w_{ij} f_i^2 - \sum_{i,j} w_{ij} f_i f_j \tag{1} \] Now, since \(W\) is symmetric, \(w_{ij} = w_{ji}\), then: \[ \begin{align*} \sum_{i,j} w_{ij} f_i^2 &= \frac{1}{2} \sum_{i,j} w_{ij}f_i^2 + \frac{1}{2} \sum_{i,j} w_{ji}f_j^2 \\\\ &= \frac{1}{2} \sum_{i,j} w_{ij}(f_i^2 + f_j^2). \end{align*} \] Plugging this into the expression (1): \[ \begin{align*} \boldsymbol{f}^\top \mathbf{L} \boldsymbol{f} &= \frac{1}{2} \sum_{i,j} w_{ij}(f_i^2 + f_j^2) - \sum_{i,j} w_{ij} f_i f_j \\\\ &= \frac{1}{2} \sum_{i,j} w_{ij}(f_i^2 + f_j^2 - 2f_i f_j) \\\\ &= \boxed{\frac{1}{2} \sum_{i,j} w_{ij} (f_i - f_j)^2.} \tag{1} \end{align*} \]
Case 1: \(k = 1\)
Suppose \(\boldsymbol{f} \in \mathbb{R}^n\) is an eigenvector corresponding to eigenvalue 0, i.e., \(\mathbf{L} \boldsymbol{f} = 0\). Then: \[ \boldsymbol{f}^\top \mathbf{L} \boldsymbol{f} = 0 \quad \Rightarrow \quad \sum_{i,j} w_{ij}(f_i - f_j)^2 = 0. \] Since \(w_{ij} \geq 0\), each term \((f_i - f_j)^2\) must be zero wherever \(w_{ij} > 0\). That is, for every edge \((i,j) \in E\), we have: \[ f_i = f_j. \] Because the graph is connected, there is a path between any two vertices. Applying the above equality recursively along paths in the graph implies that: \[ f_i = f_j \quad \text{for all } i,j. \] Therefore, \(\boldsymbol{f}\) must be constant on all vertices — i.e., \(\boldsymbol{f} \in \text{span}\{\mathbf{1}\}\). Hence, the nullspace of \(\mathbf{L}\) is one-dimensional and spanned by the constant vector \(\mathbf{1} = [1, 1, \dots, 1]^\top\). This confirms that the eigenvalue 0 has multiplicity 1.

Case 2: \(k > 1\)
Then \(\mathbf{W}\) and \(\mathbf{L}\) can be written in block diagonal form, where each block corresponds to one connected component. Specifically, \[ \mathbf{L} = \begin{bmatrix} L_1 & 0 & \cdots & 0 \\ 0 & L_2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & L_k \end{bmatrix}, \] where each \(L_i\) is the Laplacian of the \(i\)-th connected component. From Case 1, each \(L_i\) has nullspace spanned by the constant vector \(\mathbf{1}_{S_i}\) (the indicator vector on component \(S_i\)). Therefore, the full nullspace of \(\mathbf{L}\) is spanned by \(\mathbf{1}_{S_1}, \ldots, \mathbf{1}_{S_k}\), and the eigenvalue 0 has multiplicity \(k\).

We can consider the graph Laplacian as a linear operator that acts as a discrete analogue of a differential operator, computing finite differences of \(\boldsymbol{f}\) at each node: \[ (\boldsymbol{L}\boldsymbol{f})(i) = \sum_{j \in \text{nbr}_i} W_{ij} [\boldsymbol{f}_i - \boldsymbol{f}(j)] \] where \(\text{nbr}_i\) is the set of neighbors of node \(i\).

Note that the Laplacian matrix \(\mathbf{L}\) is symmetric positive semi-definite because \(w_{ij} \geq 0\), and thus \(\boldsymbol{f}^\top \mathbf{L} \boldsymbol{f} \geq 0\) for all \(\boldsymbol{f} \in \mathbb{R}^N\). The expression (1) in the above proof is called the Laplacian quadratic form, or the Dirichlet energy of \(\boldsymbol{f}\): \[ \boldsymbol{f}^\top \mathbf{L} \boldsymbol{f} = \frac{1}{2} \sum_{i,j} w_{ij}(f_i - f_j)^2. \] It measures the smoothness of the function \(\boldsymbol{f}\) on the graph: it becomes small when adjacent nodes (i.e., those with \(w_{ij} > 0\)) have similar values.

Moreover, since \(\mathbf{L}\) is symmetric and positive semi-definite, it admits an orthogonal diagonalization (also called a spectral decomposition): \[ \mathbf{L} = \mathbf{V}\boldsymbol{\Lambda}\mathbf{V}^\top, \quad 0 = \lambda_1 \le \lambda_2 \le \cdots \le \lambda_N, \] where \(\boldsymbol{\Lambda}\) contains the real, non-negative eigenvalues \(\lambda_i\) of \( \mathbf{L}\), and the columns of \(\mathbf{V}\) form an orthonormal basis of the corresponding eigenvectors.

The smallest eigenvalues correspond to the smoothest variations of a function on the graph. This property forms the mathematical foundation of spectral clustering, which uses the eigenvectors of \(\mathbf{L}\) (or its normalized form) to embed data points into a lower-dimensional space that reveals cluster structure.

The Dirichlet energy thus plays a central role in spectral graph theory. In spectral clustering, minimizing this energy reveals groupings where nodes are densely connected within clusters but sparsely connected across clusters — making it a powerful tool for uncovering natural structures in complex data.

In practice, graphs are often exhibit irregular structure: nodes may have highly varying degrees, and clusters are not perfectly block-separated. This means the raw graph Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\) can be dominated by high-degree nodes, leading to unbalanced or misleading spectral embeddings.

To address this, we use the normalized graph Laplacian, which compensates for differences in node connectivity and ensures that each node contributes more equally to the spectral structure: \[ \begin{align*} \mathbf{L}_{\text{sym}} &= \mathbf{D}^{-\frac{1}{2}} \mathbf{L} \mathbf{D}^{-\frac{1}{2}} \\\\ &= \mathbf{I} - \mathbf{D}^{-\frac{1}{2}} \mathbf{W} \mathbf{D}^{-\frac{1}{2}} \end{align*} \] where \(\mathbf{D}\) is the diagonal degree matrix and \(\mathbf{W}\) is the symmetric weight (similarity) matrix. In this case, for the normalized graph Laplacian \(\mathbf{L}_{\text{sym}}\), the eigenspace of zero eigenvalue is spanned by \[ \mathbf{D}^{\frac{1}{2}} \mathbf{1}_{S_k}, \] where \(\mathbf{1}_{S_k}\) are the indicator vectors of the connected components.

The spectral clustering algorithm proceeds as follows:

Compute the symmetric normalized Laplacian \(\mathbf{L}_{\text{sym}} = \mathbf{I} - \mathbf{D}^{-\frac{1}{2}} \mathbf{W} \mathbf{D}^{-\frac{1}{2}}\).
Find the smallest \(K\) eigenvectors of \(\mathbf{L}_{\text{sym}}\) and stack them column-wise into a matrix \(\mathbf{U} \in \mathbb{R}^{N \times K}\).
Normalize each row of \(\mathbf{U}\) to have unit norm, forming a matrix \(\mathbf{T} \in \mathbb{R}^{N \times K}\): \[ t_{ij} = \frac{u_{ij}}{\sqrt{\sum_k u_{ik}^2}}. \]
Apply K-means clustering to the rows of \(\mathbf{T}\).
Assign the original data point \(x_i\) to cluster \(k\) if row \(i\) of \(\mathbf{T}\) is assigned to cluster \(k\).

Demo

Note: In random initialization, we simply pick \(K\) data points uniformly at random to serve as the initial cluster centers. While this method is fast, it can be highly sensitive to outliers or imbalanced data distributions, often leading to poor cluster quality or slow convergence. To address this, the K-means++ initialization strategy is widely used in practice. It selects initial centers more carefully by favoring points that are far apart from already chosen centers, significantly improving the stability and performance of K-means clustering and often leading to better local optima.

Algorithm: K-means++ Initialization

Choose the first center \(\mu_1\) uniformly at random from the dataset \(\{x_1, \ldots, x_N\}\).
For each remaining point \(x_i\), compute its squared distance to the nearest selected center: \[ D(x_i) = \min_{1 \leq j \leq m} \|x_i - \mu_j\|_2^2 \] where \(\mu_j\) is one of the centers already chosen.
Choose the next center \(\mu_{m+1}\) from the data points with probability proportional to \(D(x_i)\): \[ \Pr(x_i \text{ is chosen}) = \frac{D(x_i)}{\sum_j D(x_j)} \]
Repeat steps 2-3 until \(K\) centers have been selected.
Proceed with the standard K-means algorithm using these \(K\) initial centers.

Machine Learning

Introduction

K-means Clustering

Vector Quantization (VQ)

Spectral Clustering

Demo