Introduction
Clustering is one of the core tasks in unsupervised learning. Unlike supervised learning, where models learn from labeled data,
unsupervised learning aims to uncover hidden patterns or structures in data without any predefined labels. Among these tasks, clustering
is particularly important — it groups data points into clusters based on their similarity, revealing the underlying structure of the data.
In previous sections, we explored techniques like Principal Component Analysis (PCA) and autoencoders, which
reduce the dimensionality of high-dimensional data while preserving meaningful information. These dimensionality reduction methods helped us uncover
compact representations of data, often referred to as latent features.
Now that we have the tools to represent complex data in a lower-dimensional space, we turn to the next natural question:
Can we identify meaningful groupings within this reduced space? This is the goal of clustering, which allows us to segment
data into coherent groups — for example, grouping similar images, organizing unlabeled documents, or detecting communities in social networks.
K-means Clustering
We begin with K-means clustering, one of the most widely used and intuitive algorithms.
We assume that there are \(K\) cluster centers \(\mu_k \in \mathbb{R}^D\), for \(k = 1, \cdots, K\).
Each data point \(x_n \in \mathbb{R}^D\) is assigned to its nearest center:
\[
z_n^* = \arg \min_k \| x_n - \mu_k \|_2^2.
\]
Given the assignments, the cluster centers are updated as the mean of points in each cluster:
\[
\mu_k = \frac{1}{N_k} \sum_{n : z_n =k} x_n, \,\text{where } N_k = \sum_{n=1}^N \mathbb{I}[z_n = k].
\]
These two steps — assignment and update — are repeated until convergence.
This process can be viewed as minimizing a cost function known as the distortion, which
is equivalent to the squared Frobenius norm of the reconstruction error:
\[
\begin{align*}
J(M, Z) &= \sum_{n = 1}^N \| x_n - \mu_{z_n} \|_2^2 \\\\
&= \| X - ZM^\top \|_F^2
\end{align*}
\]
where \(X \in \mathbb{R}^{N \times D}\), \(Z \in \{0, 1\} ^{N \times K}\), and \(M \in \mathbb{R}^{D \times K}\) contains
the cluster centers \(\mu_k\) in its columns.
Note: \(Z\) is the one-hot eocoding(or dummy encoding) matrix whose entries are
\[
Z_{nk} = \begin{cases}
1, &\text{ if \(x_n\) is assigned to cluster \(k\) } \\
0, &\text{ otherwise}.
\end{cases}
\]
Vector Quantization (VQ)
A key idea of vector quantization is to compress data by replacing each high-dimensional vector
\(x_n \in \mathbb{R}^D\) with a discrete symbol \(z_n \in \{1, \cdots, K\}\), which is an index into a
codebook of \(K\) prototype vectors, \(\mu_k \in \mathbb{R}^D\).
Each data point is encoded by finding the index of the closest prototype using Euclidean distance:
\[
z_n := f_e (x_n) = \arg \min_k \| x_n - \mu_k \|^2.
\]
The corresponding decoder simply maps the index back to the prototype:
\[
f_d(k) = \mu_k.
\]
The reconstruction of each data point is therefore \(\hat{x}_n = f_d(z_n) = \mu_{z_n}\).
The quality of a codebook is measured using the reconstruction error (also called distortion):
\[
\begin{align*}
J &= \frac{1}{N} \sum_{n =1}^N \| x_n - f_d (f_e (x_n))\|^2 \\\\
&= \frac{1}{N} \sum_{n=1}^N \|x_n - \mu_{z_n} \|^2.
\end{align*}
\]
Indeed, this is equivalent to the cost function minimized by the K-means algorithm, which can be interpreted
as a special case of vector quantization where the codebook is learned by minimizing distortion via iterative updates. VQ is
more focused on signal compression and encoding, whereas K-means is usually employed for data analysis and clustering tasks.
Spectral Clustering
Traditional clustering methods like K-means assume clusters are linearly separable and spherical in shape.
However, many real-world datasets exhibit complex, non-convex structures that are not well-captured by distance alone.
Spectral clustering addresses this limitation by transforming the data into a new space that reflects
the connectivity structure of the data, often represented as a graph.
The idea is to build a similarity graph where each data point is a node, and edges encode pairwise similarities.
Let \(\mathbf{W} \in \mathbb{R}^{N \times N}\) be a symmetric weight matrix such that \(W_{ij} = W_{ji} \geq 0\) measures
the similarity between points \(i\) and \(j\). The degree of node \(i\) is defined as:
\[
d_i = \sum_{j=1}^N W_{ij},
\]
and we define the degree matrix \(\mathbf{D} \in \mathbb{R}^{N \times N}\) as a diagonal matrix with entries \(D_{ii} = d_i\).
The fundamental object in spectral clustering is the graph Laplacian, defined as:
\[
\mathbf{L} = \mathbf{D} - \mathbf{W}.
\]
So, the elements of \(\mathbf{L}\) are given by
\[
\begin{cases}
d_i &\text{if \(i = j\)} \\
-W_{ij} &\text{if \(i \neq j\) & \(W_{ij} \neq 0\)} \\
0 &\text{Otherwise}.
\end{cases}
\]
The graph Laplacian captures the structure of the graph in a way that supports clustering via eigenvector analysis.
Theorem:
Let \(G\) be an undirected (possibly weighted) graph with Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\). Then the number of connected
components of \(G\) is equal to the multiplicity \(k\) of the eigenvalue 0 of \(\mathbf{L}\). Moreover, the eigenspace corresponding
to eigenvalue 0 is spanned by the indicator vectors \(\mathbf{1}_{S_1}, \ldots, \mathbf{1}_{S_k}\), where each \(S_j\) is a
connected component of \(G\).
This makes spectral methods especially powerful for separating well-connected groups in a graph.
Proof:
Let \(G = (V, E)\) be an undirected (possibly weighted) graph with Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\),
where \(\mathbf{W}\) is a symmetric weight matrix and \(\mathbf{D}\) is the diagonal degree matrix \(D_{ii} = \sum_j W_{ij}\).
For any vector \(\mathbf{f} \in \mathbb{R}^N\), the quadratic form of the Laplacian is:
\[
\mathbf{f}^\top \mathbf{L} \mathbf{f} = \mathbf{f}^\top (\mathbf{D} - \mathbf{W})\mathbf{f}
= \sum_i D_{ii} f_i^2 - \sum_{i,j} W_{ij} f_i f_j.
\]
Since \(D_{ii} = \sum_j W_{ij}\), we can write:
\[
\sum_i D_{ii} f_i^2 = \sum_{i,j} W_{ij} f_i^2.
\]
So the full expression becomes:
\[
\mathbf{f}^\top \mathbf{L} \mathbf{f} = \sum_{i,j} W_{ij} f_i^2 - \sum_{i,j} W_{ij} f_i f_j \tag{1}
\]
Now, since \(W\) is symmetric, \(W_{ij} = W_{ji}\), then:
\[
\begin{align*}
\sum_{i,j} W_{ij} f_i^2 &= \frac{1}{2} \sum_{i,j} W_{ij}f_i^2 + \frac{1}{2} \sum_{i,j} W_{ji}f_j^2 \\\\
&= \frac{1}{2} \sum_{i,j} W_{ij}(f_i^2 + f_j^2).
\end{align*}
\]
Plugging this into the expression (1):
\[
\begin{align*}
\mathbf{f}^\top \mathbf{L} \mathbf{f} &= \frac{1}{2} \sum_{i,j} W_{ij}(f_i^2 + f_j^2) - \sum_{i,j} W_{ij} f_i f_j \\\\
&= \frac{1}{2} \sum_{i,j} W_{ij}(f_i^2 + f_j^2 - 2f_i f_j) \\\\
&= \boxed{\frac{1}{2} \sum_{i,j} W_{ij} (f_i - f_j)^2.}
\end{align*}
\]
Case 1: \(k = 1\)
Suppose \(\mathbf{f} \in \mathbb{R}^n\) is an eigenvector corresponding to eigenvalue 0, i.e., \(\mathbf{L} \mathbf{f} = 0\). Then:
\[
\mathbf{f}^\top \mathbf{L} \mathbf{f} = 0 \quad \Rightarrow \quad \sum_{i,j} W_{ij}(f_i - f_j)^2 = 0.
\]
Since \(W_{ij} \geq 0\), each term \((f_i - f_j)^2\) must be zero wherever \(W_{ij} > 0\). That is, for every edge \((i,j) \in E\), we have:
\[
f_i = f_j.
\]
Because the graph is connected, there is a path between any two vertices. Applying the above equality recursively along paths in the graph implies that:
\[
f_i = f_j \quad \text{for all } i,j.
\]
Therefore, \(\mathbf{f}\) must be constant on all vertices — i.e., \(\mathbf{f} \in \text{span}\{\mathbf{1}\}\). Hence, the nullspace of \(\mathbf{L}\)
is one-dimensional and spanned by the constant vector \(\mathbf{1} = [1, 1, \dots, 1]^\top\). This confirms that the eigenvalue 0 has multiplicity 1.
Case 2: \(k > 1\)
Then \(\mathbf{W}\) and \(\mathbf{L}\) can be written in block diagonal form, where each block corresponds to one connected component.
Specifically,
\[
\mathbf{L} = \begin{bmatrix}
L_1 & 0 & \cdots & 0 \\
0 & L_2 & \cdots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
0 & 0 & \cdots & L_k
\end{bmatrix},
\]
where each \(L_i\) is the Laplacian of the \(i\)-th connected component.
From Case 1, each \(L_i\) has nullspace spanned by the constant vector \(\mathbf{1}_{S_i}\)
(the indicator vector on component \(S_i\)). Therefore, the full nullspace of \(\mathbf{L}\) is spanned
by \(\mathbf{1}_{S_1}, \ldots, \mathbf{1}_{S_k}\), and the eigenvalue 0 has multiplicity \(k\).
Note: Clearly, the graph Laplacian \(\mathbf{L}\) is symmetric and positive semi-definite, since
\(\mathbf{f}^\top \mathbf{L} \mathbf{f} \geq 0\) for all \(\mathbf{f} \in \mathbb{R}^N\). This expression is called the
Laplacian quadratic form, or the Dirichlet energy of \(\mathbf{f}\):
\[
\mathbf{f}^\top \mathbf{L} \mathbf{f} = \frac{1}{2} \sum_{i,j} W_{ij}(f_i - f_j)^2.
\]
It measures the smoothness of the function \(\mathbf{f}\) on the graph: it becomes small when adjacent nodes
(i.e., those with \(W_{ij} > 0\)) have similar values.
The Dirichlet energy plays a central role in spectral graph theory. In spectral clustering, minimizing Dirichlet
energy reveals groupings where nodes are strongly connected internally but weakly connected across clusters —
making it a powerful tool for discovering natural partitions in complex data.
In practice, graphs are often exhibit irregular structure:
nodes may have highly varying degrees, and clusters are not perfectly block-separated.
This means the raw graph Laplacian \(\mathbf{L} = \mathbf{D} - \mathbf{W}\) can be dominated by high-degree nodes,
leading to unbalanced or misleading spectral embeddings.
To address this, we use the normalized graph Laplacian, which compensates for
differences in node connectivity and ensures that each node contributes more equally to the
spectral structure:
\[
\begin{align*}
\mathbf{L}_{\text{sym}} &= \mathbf{D}^{-\frac{1}{2}} \mathbf{L} \mathbf{D}^{-\frac{1}{2}} \\\\
&= \mathbf{I} - \mathbf{D}^{-\frac{1}{2}} \mathbf{W} \mathbf{D}^{-\frac{1}{2}}
\end{align*}
\]
where \(\mathbf{D}\) is the diagonal degree matrix and \(\mathbf{W}\) is the symmetric weight (similarity) matrix. In this case,
the eigenspace of 0 is spanned by \(\mathbf{D}^{\frac{1}{2}} \mathbf{1}_{S_k}\)
This matrix \(\mathbf{L}_{\text{sym}}\) is also symmetric and positive semi-definite.
It ensures that the resulting spectral embedding is not biased toward high-degree nodes.
The spectral clustering algorithm proceeds as follows:
- Compute the symmetric normalized Laplacian \(\mathbf{L}_{\text{sym}} = \mathbf{I} - \mathbf{D}^{-\frac{1}{2}} \mathbf{W} \mathbf{D}^{-\frac{1}{2}}\).
- Find the smallest \(K\) eigenvectors of \(\mathbf{L}_{\text{sym}}\) and stack them column-wise into a matrix \(\mathbf{U} \in \mathbb{R}^{N \times K}\).
- Normalize each row of \(\mathbf{U}\) to have unit norm, forming a matrix \(\mathbf{T} \in \mathbb{R}^{N \times K}\):
\[
T_{i \cdot} = \frac{U_{i \cdot}}{\| U_{i \cdot}\|}.
\]
- Apply K-means clustering to the rows of \(\mathbf{T}\).
- Assign the original data point \(x_i\) to cluster \(k\) if row \(i\) of \(\mathbf{T}\) is assigned to cluster \(k\).
Demo
Note: In random initialization, we simply pick \(K\) data points uniformly at random to serve as the initial cluster centers.
While this method is fast, it can be highly sensitive to outliers or imbalanced data distributions, often leading to poor cluster quality
or slow convergence. To address this, the K-means++ initialization strategy is widely used in practice.
It selects initial centers more carefully by favoring points that are far apart, significantly improving the stability and performance of K-means clustering.
Algorithm: K-means++ Initialization
- Choose the first center \(\mu_1\) uniformly at random from the dataset \(\{x_1, \ldots, x_N\}\).
- For each remaining point \(x_i\), compute its squared distance to the nearest selected center:
\[
D(x_i) = \min_{1 \leq j \leq m} \|x_i - \mu_j\|_2^2
\]
where \(\mu_j\) is one of the centers already chosen.
- Choose the next center \(\mu_{m+1}\) from the data points with probability proportional to \(D(x_i)\):
\[
\Pr(x_i \text{ is chosen}) = \frac{D(x_i)}{\sum_j D(x_j)}
\]
- Repeat steps 2-3 until \(K\) centers have been selected.
- Proceed with the standard K-means algorithm using these \(K\) initial centers.