Symmetric Matrices
A symmetric matrix is a square matrix \(A\) such that
\[A^T= A\]
For example, in the context of least-squares solutions, \(\hat{\beta}\) satisfies \(X^TX\beta = X^TY\).
In this expression, \(X^TX\) is always symmetric. To confirm, let \(X\) be an \(m \times n\) matrix. Then
\[(X^TX)^T = X^T(X^T)^T = X^TX.\]
An \(n \times n\) matrix \(A\) is said to be orthogonally diagonalizable if there exists
an orthogonal matrix \(P\) and a diagonal matrix \(D\) such that:
\[
A = PDP^T = PDP^{-1}
\]
For this diagonalization, \(A\) must have \(n\) linearly independent and orthonormal eigenvectors.
Importantly, in such cases, \(A\) is symmetric because
\[
\begin{align*}
A^T = (PDP^T)^T &= (P^T)^T D^T P^T \\\\
&= PDP^T \\\\
&= A.
\end{align*}
\]
The converse is also always true. If \(A\) is symmetric, it is guaranteed to be orthogonally diagonalizable.
Theorem 1:
An \(n \times n\) matrix \(A\) is orthogonally diagonalizable if and only if \(A\) is
a symmetric matrix.
On Eigenvalues & Eigenvectors, we considered the example:
\[
\begin{align*}
A &= \begin{bmatrix} 4 & 1 & 1\\ 1 & 4 & 1 \\ 1 & 1 & 4 \\ \end{bmatrix} \\\\
&= PDP^{-1} \\\\
&= \begin{bmatrix} -1 & -1 & 1 \\ 0 & 1 & 1 \\ 1 & 0 & 1 \\ \end{bmatrix}
\begin{bmatrix} 3 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 6 \\ \end{bmatrix}
\begin{bmatrix} \frac{-1}{3} & \frac{-1}{3} & \frac{2}{3} \\
\frac{-1}{3} & \frac{2}{3} & \frac{-1}{3} \\
\frac{1}{3} & \frac{1}{3} & \frac{1}{3} \end{bmatrix}.
\end{align*}
\]
Since, \(A\) is symmetric, it must be orthogonally diagonalizable. To ensure \(P\) is orthonormal, we need
orthonormal eigenvectors. This can be achieved using the Gram-Schmidt algorithm on the columns of \(P\).
Since only \(v_1 \cdot v_2 \neq = 0\) needs adjustment,
\[
\begin{align*}
v'_2 &= v_2 - \frac{v_2 \cdot v_1}{v_1 \cdot v_1}v_1 \\\\
&= \begin{bmatrix} -1\\ 1\\ 0\\ \end{bmatrix} - \frac{1}{2}\begin{bmatrix} -1\\ 0\\ 1\\ \end{bmatrix} \\\\
&= \begin{bmatrix} \frac{-1}{2}\\ 1 \\ \frac{-1}{2} \end{bmatrix}.
\end{align*}
\]
After normalization, the orthonormal eigenvectors are:
\[
u_1 = \begin{bmatrix} \frac{-1}{\sqrt{2}} \\ 0\\ \frac{1}{\sqrt{2}} \\ \end{bmatrix}
,\,
u_2 = \begin{bmatrix} \frac{-1}{\sqrt{6}}\\ \frac{2}{\sqrt{6}}\\ \frac{-1}{\sqrt{6}}\\ \end{bmatrix},
\, \text{and }
u_3 = \begin{bmatrix} \frac{1}{\sqrt{3}} \\ \frac{1}{\sqrt{3}} \\ \frac{1}{\sqrt{3}} \\ \end{bmatrix}
\]
Therefore,
\[
\begin{align*}
A &= PDP^T \\\\
&= \begin{bmatrix} \frac{-1}{\sqrt{2}} & \frac{-1}{\sqrt{6}} & \frac{1}{\sqrt{3}} \\
0 & \frac{2}{\sqrt{6}} & \frac{1}{\sqrt{3}}\\
\frac{1}{\sqrt{2}} & \frac{-1}{\sqrt{6}} & \frac{1}{\sqrt{3}} \end{bmatrix}
\begin{bmatrix} 3 & 0 & 0 \\ 0 & 3 & 0 \\ 0 & 0 & 6 \\ \end{bmatrix}
\begin{bmatrix} \frac{-1}{\sqrt{2}} & 0 & \frac{1}{\sqrt{2}} \\
\frac{-1}{\sqrt{6}} & \frac{2}{\sqrt{6}} & \frac{-1}{\sqrt{6}} \\
\frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} & \frac{1}{\sqrt{3}} \end{bmatrix} \\\\
&= PDP^{-1}
\end{align*}
\]
Quadratic Forms
A quadratic form on \(\mathbb{R}^n\) is a function
\[
Q(x) = x^TAx \, \in \mathbb{R}
\]
where \(A\) is an \(n\times n\) symmetric matrix.
Let's use our symmetric matrix again:
\[
A = \begin{bmatrix} 4 & 1 & 1\\ 1 & 4 & 1 \\ 1 & 1 & 4 \\ \end{bmatrix}
\]
Then:
\[
\begin{align*}
Q(x) &= x^TAx \\\\
&= 4x_1^2 +4 x_2^2 + 4x_3^2 + 2x_1x_2 + 2x_2x_3 + 2x_3x_1
\end{align*}
\]
We can transform this quadratic form into one with no cross-product terms.
Let \(x = Py\) be a change of variable where \(P\) is an invertible matrix and \(y\) is a
new variable in \(\mathbb{R}^n\). Since \(A\) is symmetric, by Theorem 1, \(A\) is
orthogonally diagonalizable. Thus
\[
\begin{align*}
x^TAx &= (Py)^TA(Py) \\\\
&= y^TP^TAPy \\\\
&= y^T(P^TAP)y \\\\
&= y^T(P^TPDP^TP)y \\\\
&= y^TDy
\end{align*}
\]
where \(y = P^{-1}x = P^Tx\).
Since \(P^T A P = D\), where \(D\) is a diagonal matrix, we get:
\[
\begin{align*}
x^T A x &= y^T D y \\\\
&= \lambda_1 y_1^2 + \lambda_2 y_2^2 + \cdots + \lambda_n y_n^2
\end{align*}
\]
where \(\lambda_1, \lambda_2, \ldots, \lambda_n\) are the eigenvalues of \(A\).
For our example, \(Q(x)\) becomes:
\[
3y_1^2 + 3y_2^2 + 6 y_3^2
\]
Thus, the eigenvalues of \(A\) are directly connected to its quadratic form. First,
we define the classification of quadratic forms.
A quadratic form \(Q(x)\) is classified as:
- Positive definite:
\(\qquad \qquad A \succ 0\) if \(\, \forall x \neq 0, \, Q(x) > 0\).
- Positive semi-definite:
\(\qquad \qquad A \succeq 0\) if \(\, \forall x, \, Q(x) \geq 0\).
- Negative definite:
\(\qquad \qquad A \prec 0\) if \(\, \forall x \neq 0, \, Q(x) < 0\).
- Negative semi-definite:
\(\qquad \qquad A \preceq 0\) if \(\, \forall x, \, Q(x) \leq 0\).
- Indefinite:
\(\qquad \qquad\) otherwise.
Theorem 2:
Let \(A\) be an \(n \times n\) symmetric matrix. Then a quadratic form \(x^TAx\) is
- positive definite iff the eigenvalues of \(A\) are all positive.
- negative definite iff the eigenvalues of \(A\) are all negative.
As mentined above, an orthogonal change of variable \(x= Py\) transforms \(x^TAx\) into
\[
Q(x) = y^T D y = \lambda_1 y_1^2 + \cdots + \lambda_n y_n^2.
\]
where \(\lambda_1, \ldots, \lambda_n\) are the eigenvalues of \(A\).
There is a one-to-one correspondence between all \(x \neq 0\) and \(y \neq 0\) because \(P\) is an
invertible matrix. So, \(Q(x)\) is determined by the signs of eigenvalues(or spectrum) of \(A\).
The columns of \(P\) are called the principal axes of the quadratic form \(x^TAx\).
Note: To check whether a matrix \(A\) is positive definite, it is effective to verify the existence of a
Cholesky factorization \(A = R^TR\) where \(R\)
is an upper triangular matrix with positive diagonal entries. This factorization is a modified version of
the LU factorization.
Singular Value Decomposition(SVD)
Let \(A\) be an \(m \times n\) matrix. The symmetric matrix \(A^TA\) is orthogonally diagonalizable.
Let \(\{v_1, \cdots, v_n\}\) be an orthonormal basis for \(\mathbb{R}^n\) consisting of eigenvectors
of \(A^TA\) with corresponding eigenvalues \(\lambda_1, \cdots, \lambda_n\).
For \(1 \leq i \leq n\), the following holds:
\[
\begin{align*}
\| Av_i \|^2 &= (Av_i)^TAv_i = v_i^T(A^TAv_i) \\\\
&= v_i^T(\lambda_iv_i) \\\\
&= \lambda_i.
\end{align*}
\]
Since \(\| Av_i \|^2 \geq 0\), we conclude that \(\lambda_i \geq 0 \).
We assume the eigenvalues are arranged in descending order:
\[
\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_n
\]
For \(1 \leq i \leq n\), the singular values of \(A\) are defined as
\[
\sigma_i = \sqrt{\lambda_i} = \| Av_i \| \geq 0.
\]
If \(A\) has \(r\) nonzero singular values, then \(\{Av_1, \cdots, Av_r\}\) forms an orthogonal basis
for \(\text{Col }A\) and
\[\text{rank } A = \dim \text{ Col }A = r.
\]
Theorem 3: Singular Value Decomposition
Let \(A\) be an \(m \times n\) matrix with rank \(r\). Then there exist an \(m \times n\)
matrix \(\Sigma = \begin{bmatrix} D & 0 \\ 0 & 0 \\ \end{bmatrix}\) where \(D\) is a \(r \times r\)
diagonal matrix with the first \(r\) nonzero singular values of \(A\), \(\sigma_1 \geq \sigma_2 \geq \cdots \geq \sigma_r > 0\)
and there exist an \(m \times m\) orthogonal matrix \(U\), and an \(n \times n\) orthogonal matrix \(V\) such that
\[A = U\Sigma V^T\]
Proof:
Normalize orthogonal basis for \(\text{Col }A\) \(\{Av_1, \cdots, Av_r\}\), then we obtain an orthonormal basis
\(\{u_i, \cdots, u_r\}\), where
\[
u_i = \frac{1}{\| Av_i \|}Av_i = \frac{1}{\sigma_i}Av_i \quad 1 \leq i \leq r
\]
or,
\[
Av_i = \sigma_i u_i \tag{1}
\]
We extend \(\{u_i, \cdots, u_r\}\) to an orthonormal basis for \(\mathbb{R}^m\), \(\{u_i, \cdots, u_r, u_{r+1}, \cdots, u_m\}\).
Let \(U = \begin{bmatrix}u_1 & \cdots & u_m \end{bmatrix}\) and \(V = \begin{bmatrix}v_1 & \cdots & v_n \end{bmatrix}\).
Clearly, by construction, \(U\) and \(V\) are orthogonal matrices. Also, by Equation (1),
\[AV = \begin{bmatrix} \sigma_1u_1 & \cdots & \sigma_ru_r & 0 & \cdots & 0 \end{bmatrix}.\]
Let \(D\) be a diagonal matrix with diagonal entries \(\sigma_1, \cdots, \sigma_r\) and
let \(\Sigma =\begin{bmatrix} D & 0 \\ 0 & 0 \\ \end{bmatrix}\). Then
\[
\begin{align*}
U\Sigma &= \begin{bmatrix}\sigma_1u_1 & \cdots & \sigma_ru_r & 0 & \cdots & 0\end{bmatrix} \\\\
&= AV
\end{align*}
\]
Finally, since \(V\) is an orthogonal matrix,
\[(U\Sigma ) V^T = (AV)V^T = AI = A.\]
Steps for SVD:
- Compute Orthogonal decomposition of \(A^TA\):
Compute the eigenvalues and eigenvectors of \(A^TA\).
Note: In practice, avoid directly computing \(A^TA\) to prevent numerical errors.
- Construct \(V\) and \(\Sigma\):
Arrange eigenvalues in descending order.
Use the corresponding unit eigenvectors as the columns of \(V\).
- Construct \(U\):
For nonzero singular values, compute \(u_k = \frac{1}{\sigma_k}Av_k\).
If there are not enough nonzero singular values, extend the orthonormal basis.
The singular values give us a powerful tool to diagnose the stability of a linear system. The condition number of a matrix \(A\), denoted \(\kappa(A)\), measures how sensitive the output \(Ax\) is to small changes (like floating-point errors) in the input \(x\).
For a general \(m \times n\) matrix \(A\) with rank \(r\), the condition number is defined as the ratio of the largest to the smallest nonzero singular value:
\[
\kappa(A) = \frac{\sigma_1}{\sigma_r}
\]
If \(A\) is a square, invertible matrix, this simplifies to \(\kappa(A) = \sigma_1 / \sigma_n\).
A matrix with a very large condition number (\(\kappa(A) \gg 1\)) is called ill-conditioned. This means the matrix is "almost singular." From a CS and engineering perspective, this is a critical numerical red flag. Solving \(Ax=b\) for an ill-conditioned \(A\) can lead to highly inaccurate results, as small rounding errors in \(b\) or \(A\) get magnified into large errors in the solution \(x\).
Pseudo-inverse via SVD
The Moore-Penrose Pseudo-inverse (\(A^\dagger\)) is most stably defined and computed using the Singular Value Decomposition (SVD).
Given the SVD of \(A = U\Sigma V^T\), where \(U\) is \(m \times m\), \(\Sigma\) is \(m \times n\), and \(V\) is \(n \times n\), the pseudo-inverse is:
\[
A^\dagger = V\Sigma^\dagger U^T
\]
Here, \(\Sigma^\dagger\) is an \(n \times m\) matrix formed by:
- Transposing \(\Sigma\) to get \(\Sigma^T\) (which is \(n \times m\)).
- Inverting every nonzero singular value \(\sigma_i\) on the diagonal to get \(1/\sigma_i\).
If \(\Sigma\) has diagonal entries \(\sigma_1, \dots, \sigma_r > 0\), then \(\Sigma^\dagger\) will have diagonal entries \(1/\sigma_1, \dots, 1/\sigma_r\). All other entries are zero.
SVD, Pseudo-inverse, and Fundamental Subspaces
This definition has a beautiful geometric interpretation. Recall that \(\hat{x} = A^\dagger b\) is the minimum-norm least-squares solution.
- A least-squares solution \(\hat{x}\) must map to the projection of \(b\) onto \(\text{Col}(A)\).
- The minimum-norm requirement forces the solution \(\hat{x}\) to lie entirely within the Row Space of \(A\), i.e., \(\hat{x} \in \text{Row}(A)\). (Any component in \(\text{Nul}(A)\) would be orthogonal to \(\text{Row}(A)\), adding to the norm of \(\hat{x}\) without changing the result \(A\hat{x}\).)
The SVD beautifully handles this. It decomposes \(A\) into its fundamental subspaces. The pseudo-inverse essentially "inverts" the mapping from \(\text{Row}(A)\) to \(\text{Col}(A)\) and ignores the \(\text{Nul}(A)\).
You can also see that \(A^\dagger A = V\Sigma^\dagger U^T U\Sigma V^T = V\Sigma^\dagger\Sigma V^T\). The matrix \(\Sigma^\dagger\Sigma\) is an \(n \times n\) diagonal matrix with 1s for the first \(r\) entries and 0s after. This makes \(A^\dagger A = V \begin{bmatrix} I_r & 0 \\ 0 & 0 \end{bmatrix} V^T\), which is the orthogonal projection matrix onto the Row Space of \(A\). Similarly, \(A A^\dagger\) is the orthogonal projection matrix onto the Column Space of \(A\).
Practical Note: Truncation and Regularization
In real-world applications (CS/ML), we rarely have singular values that are exactly zero. Instead, we have a spectrum of values, some of which are very small.
\[ \sigma_1 \geq \sigma_2 \geq \dots \gg \dots \geq \sigma_n \approx 0 \]
Inverting these tiny singular values (\(1/\sigma_i\)) would introduce massive instability, as we'd be amplifying components related to noise.
The practical computation of the pseudo-inverse involves setting a tolerance \(\epsilon\). We treat any \(\sigma_i < \epsilon\) as zero.
\[
(\sigma_i^\dagger) = \begin{cases} 1/\sigma_i & \text{if } \sigma_i \geq \epsilon \\ 0 & \text{if } \sigma_i < \epsilon \end{cases}
\]
This truncated SVD is the basis for Principal Component Regression (PCR) and is a powerful form of regularization. It explicitly discards the "directions" (singular vectors) that contribute least to the transformation and are most likely to be noise, building a more robust, lower-rank model.
Interactive SVD Visualization
This interactive demo visualizes the Singular Value Decomposition process for 2×2 matrices. SVD expresses any matrix A as a product A = UΣVT, which can be understood geometrically as a composition of three transformations:
- VT: Rotation/reflection using right singular vectors (orthogonal)
- Σ: Scaling along the axes (singular values)
- U: Rotation/reflection using left singular vectors (orthogonal)
Use the visualization to see how each component of SVD contributes to the overall transformation:
Try different matrices to observe how SVD works with various transformations:
- Identity: No transformation occurs
- Scaling: Only the singular values differ from 1
- Rotation: U and V are identical rotation matrices
- Symmetric: U = V (special case for symmetric matrices)
- Shear: Combination of rotation and non-uniform scaling
SVD with Fundamental Subspaces
Consider an \(m \times n\) matrix \(A = U\Sigma V^T\).
Let \(u_1, \cdots, u_m\) be the left singular vectors, \(v_1, \cdots, v_n\) be
the right singular vectors, \(\sigma_1, \cdots, \sigma_n\) be the singular values, and \(r\) be the rank of \(A\).
\(\{Av_1, \cdots, Av_r\} = \{\sigma_1u_1, \cdots, \sigma_ru_r\} \) is an orthogonal basis for \(\text{Col }A\) and then
\(\{u_1, \cdots, u_r\}\) is an orthonormal basis for \(\text{Col }A\). Also, \((\text{Col }A)^{\perp} = \text{Nul }A^T \).
Thus, \(\{u_{r+1}, \cdots, u_m\}\) is an orthonormal basis for \(\text{Nul }A^T\).
Since \(\| Av_i \| = \sigma_i = 0\) iff \(i > r \,\), \(v_{r+1}, \cdots, v_n\) span a subspace of \(\text{Nul }A\)
of dimension \(n-r\), but we know that \(\dim \text{Nul }A = n - \text{rank } A = n -r\).
Therefore, \(\{v_{r+1}, \cdots, v_n\}\) is an orthonormal basis for \(\text{Nul }A\).
Finally, since \((\text{Nul }A)^{\perp} = \text{Col }A^T = \text{Row }A \), \(\{v_1, \cdots, v_r\}\) is an orthonormal basis for \(\text{Row }A\).
Now, we assume \(A\) is an \(n \times n\) matrix. From above facts, we derive additional invertible matrix theorems (See: invertible matrix):
- \((\text{Col }A)^{\perp} = \{0\}\).
- \((\text{Nul }A)^{\perp} = \mathbb{R}^n\).
- \(\text{Row }A = \mathbb{R}^n\).
- \(A\) has \(n\) nonzero singular values.