Orthogonality

Inner Product & Orthogonality in \(\mathbb{R}^n\)

The inner product of two vectors \(u, v \in \mathbb{R}^n \) is a scalar defined as: \[ u \cdot v = u^Tv = u_1v_1 + \cdots + u_nv_n. \]
To extend the concept of "length" to higher dimensions, we define the norm of a vector \(v \in \mathbb{R}^n\) as the nonnegative scalar: \[ \|v\| = \sqrt{v \cdot v}. \] This is also called Euclidean(\(L_2\)) norm.
A vector with a length of 1 is called a unit vector, represented as \[ u = \frac{v}{\|v\|} \] The unit vector maintains the "direction" of the original vector. This process is known as normalization, which is widely used in applications like machine learning.
Now we can define the distance between two vectors \(u, v \in \mathbb{R}^n \) as \[ \text{dist }(u, v) = \|u - v\|. \] This is known as the Euclidean distance, which measures the difference or similarity between data points.
To explore orthogonality geometrically, consider the squared distances: \[\begin{align*} \|u - v\|^2 &= (u-v) \cdot (u-v) \\\\ &= u \cdots (u-v) + (-v) \cdot (u - v) \\\\ &= \|u\|^2 - 2u \cdot v + \|v\|^2 \end{align*} \] \[ \begin{align*} \|u - (-v)\|^2 &= (u+v) \cdot (u+v) \\\\ &= u \cdots (u+v) + (v) \cdot (u + v) \\\\ &= \|u\|^2 + 2u \cdot v + \|v\|^2 \end{align*} \] (Note: \(dist(u, v) \text{ and } dist(u, -v)\) are perpendicular iff they are the same.) Thus, if \(u \cdot v = 0\), the terms resemble the Pythagorean theorem: \[ \|u +v \|^2 = \|u\|^2 + \|v\|^2 \]
We define the general orthogonality:
Two vectors \(u, v \in \mathbb{R}^n \) are orthogonal to each other if their inner product satisfies \[u \cdot v =0.\]
If a vector \(v\) is orthogonal to every vector in \(W \subseteq \mathbb{R}^n\), then the vector \(v\) is said to be orthogonal to \(W\). Also, the set of all vectors that are orthogonal to \(W\) is called the orthogonal complement of \(W\), denoted by \(W^{\perp}\).

This orthogonality provides a useful relationship between the column space and null space of a matrix:

Theorem 1: Let \(A \in \mathbb{R}^{m \times n}\). Then, \[(\text{Row } A)^{\perp} = \text{Nul } A \] and \[(\text{Col } A)^{\perp} = \text{Nul } A^T\]

Proof: If \(x \in \text{Nul } A\), then \(Ax =0\), which means that \(x\) is orthogonal to each row of \(A\) because the inner product of each row of \(A\) and \(x\) are zero. Then, since the rows of \(A\) span the row space of \(A\), \(x\) is orthogonal to \(\text{Row } A \). Also, if \(x\) is orthogonal to \(\text{Row } A \), then \(x\) is orthogonal every row of \(A\) and thus \(Ax =0\), or \(x \in \text{Nul } A\). Therefore, \((\text{Row } A)^{\perp} = \text{Nul } A \).
Next, since the first equation is true for any matrix, we consider for \(A^T\). Since \(\text{Row } A^T = \text{Col } A\), we conclude that \((\text{Col } A)^{\perp} = \text{Nul } A^T \).

The Orthogonal Set

A set of vectors \(\{v_1, v_2, \cdots, v_p\}\) in \(\mathbb{R}^n\) is called an orthogonal set if \(v_i \cdot v_j =0\), \(i \neq j\). Additionally, an orthogonal basis for \(W \subseteq \mathbb{R}^n\) is a basis for \(W\) that forms an orthogonal set.

Theorem 2: Let \(\{v_1, v_2, \cdots, v_p\}\) be an orthogonal basis for \(W \subseteq \mathbb{R}^n\). For each \(y \in W\), the weights in the linear combination \(y = c_1v_1 + \cdots + c_pv_p\) are given by \[c_j = \frac{y \cdot v_j}{v_j \cdot v_j} \qquad (j = 1, \cdots, p)\]

Proof: Since \(v_1\) is orthogonal to all other vectors in the set, \[ \begin{align*} y \cdot v_1 &= (c_1v_1 + \cdots + c_pv_p) \cdot v_1 \\\\ &= c_1(v_1 \cdot v_1) + \cdots + c_p(v_p \cdot v_1) \\\\ &= c_1(v_1 \cdot v_1). \end{align*} \] By definition of the basis, \(v_1\) is nonzero, \(v_1 \cdot v_1\) must be nonzero, and then we can solve this equation for \(c_1\). Similarly, we can complute \(y \cdot v_j\) and solve for \(c_j \quad (j = 2, \cdots, p)\).

A set \(\{u_1, \cdots, u_p\}\) is said to be an orthonormal set if it is an orthogonal set of unit vectors. If \(W\) is the subspace spanned by such a set, then \(\{u_1, \cdots, u_p\}\) is an orthonormal basis for \(W\). Matrices with orthonormal columns are central to many practical applications due to their unique properties.

Theorem 3: An matrix \(U \in \mathbb{R}^{m \times n} \) has orthonormal columns iff \(U^TU = I\).

Proof: \[ \begin{align*} U^TU &= \begin{bmatrix} u_1^T \\ u_2^T \\ \vdots \\ u_n^T \end{bmatrix} \begin{bmatrix} u_1 & u_2 & \cdots & u_n \end{bmatrix} \\\\ &= \begin{bmatrix} u_1^Tu_1 & u_1^Tu_2 & \cdots & u_1^Tu_n \\ u_2^Tu_1 & u_2^Tu_2 & \cdots & u_2^Tu_n \\ \vdots & \vdots & \ddots & \vdots \\ u_n^Tu_1 & u_n^Tu_2 & \cdots & u_n^Tu_n \end{bmatrix} \end{align*} \] The columns of \(U\) are orthogonal iff \[ u_i^Tu_j = 0, i \neq j \quad (i, j = 1, \cdots, n) \] and, the columns of \(U\) have unit length iff \[ u_i^Tu_i = 1, (i = 1, \cdots, n). \] Thus, \(U^TU = I\) and the converse is also true.

Theorem 4: Let \(U \in \mathbb{R}^{m \times n} \) with orthonormal columns, and let \(x, y \in \mathbb{R}^n\). Then

\(\| Ux \| = \| x \| \)
\((Ux) \cdot (Uy) = x \cdot y \)
\((Ux) \cdot (Uy) = 0\) iff \(x \cdot y = 0\)

Proof: For Property 1, \[ \begin{align*} &\| Ux \|^2 = (Ux)^T(Ux) = x^TU^TUx = x^TIx = x^Tx \\\\ &\Longrightarrow \| Ux \| = \| x \| \end{align*} \] For Property 2, \[ \begin{align*} (Ux) \cdot (Uy) &= (Ux)^T(Uy) \\\\ &= x^TU^TUy \\\\ &= x^TIy = x^Ty \\\\ &= x \cdot y \end{align*} \] Note: We can also prove properties 1 and 3 from the Property 2.

These results are particularly significant for square matrices. we define a square invertible matrix \(U\) as an orthogonal matrix if \[U^{-1} = U^T.\] By Theorem 3, this matrix \(U\) has orthonormal columns(The name is misnomer...). Notably, the rows of \(U\) form an orthonormal basis of \(\mathbb{R}^n\).

For example, the 2D rotation matrix is an orthogonal matrix. \[ R_{\theta} = \begin{bmatrix} \cos\theta & - \sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \] Verify: \[ \begin{align*} R_{\theta}^TR_{\theta} &= R_{\theta}^{-1}R_{\theta}\\\\ &= \begin{bmatrix} \cos\theta & \sin\theta \\ -\sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} \cos\theta & - \sin\theta \\ \sin\theta & \cos\theta \end{bmatrix}\\\\ &= \begin{bmatrix} \cos^2 \theta + \sin^2 \theta & 0 \\ 0 & \sin^2 \theta + \cos^2 \theta \end{bmatrix}\\\\ &= I \end{align*} \] Also, \[\det R_{\theta} = \cos^2 \theta + \sin^2 \theta = 1\] This property generalizes to any orthogonal matrix \(U\): \[ \begin{align*} 1 = \det I &= \det ( U^TU ) \\\\ &= \det U^T \det U \\\\ &= \det U \det U \\\\ &= (\det U)^2 \end{align*} \] (Note: Since \(U\) is a square matrix, \(\det U^T = \det U\).)
Thus, for any orthogonal matrix \(U\), \[ \det U = \pm 1 \] Lastly, consider the eigenvalues of eigenvalues of \(R_{\theta}\). \[ \begin{align*} \det (U - \lambda ) &= (\cos\theta - \lambda)^2 - \sin^2 \theta \\\\ &= \lambda^2 -2 \lambda \cos\theta +1 \end{align*} \] Solve the characteristic equation: \[ \begin{align*} &\lambda^2 -2 \lambda\cos\theta +1 = 0 \\\\ &\Longrightarrow \lambda = \frac{2\cos\theta \pm 2\sqrt{(\cos^2\theta -1)}i}{2} = \cos\theta \pm i \sin\theta \end{align*} \] Then, \[ |\lambda | = 1 \]
Let \(x\) be the nonzero eigenvector correspoding to the eigenvalue of the orthogonal matrix \(U\), then \[ \begin{align*} Ux = \lambda x &\Longrightarrow \| Ux \| = |\lambda | \| x \| \\\\ &\Longrightarrow \| x \| = |\lambda | \| x \| \end{align*} \] Since \(\| x \| \neq 0 \), we conclude that \(|\lambda | = 1\ \).
(Note: \(\| Ux \| = \| x \|\) is also true in the case \(x\) is a complex vector. For complex vectors, an orthogonal matrix generalizes to a unitary matrix s.t. \(U^*U = I\) where \(U^*\) is the conjugate transpose of \(U\). Then you can prove this fact like we did in Theorem 4. So, the orthogonal matrix is just a special case of a "U"nitary matrix with real entries.)

The Orthogonal Projection

Given a nonzero vector \(u \in \mathbb{R}^n\). We often need to decompose a vector \(y\in \mathbb{R}^n\) into two vectors: \[ y= \hat{y} + z \tag{2} \] where \(\hat{y} = \alpha u\) is a multiple of \(u\) for some scalar \(\alpha\) and the vector \(z\) is orthogonal to \(u\).
For any scalar \(\alpha\), let \(z = y - \alpha u\) so that (2) holds. Then \(y - \hat{y}\) is orthogonal to \(u\) iff \[ (y - \alpha u) \cdot u = y \cdot u - \alpha (u \cdot u) = 0 \] \[ \Longrightarrow \alpha = \frac{y \cdot u}{u \cdot u} \] Thus, the orthogonal projection of \(y\) onto \(u\) is \[ \hat{y} = \frac{y \cdot u}{u \cdot u} u \]

Theorem 5: Orthogonal Decomposition Let \(W \subseteq \mathbb{R}^n\). Then \(\forall y \in \mathbb{R}^n\) can be written uniquely in the form \[y= \hat{y} + z\] where \(\hat{y} \in W\) and \(z \in W^{\perp}\). If \(\{u_1, \cdots, u_p\}\) is any orthogonal basis of \(W\), then \[ \hat{y} = \frac{y \cdot u_1}{u_1 \cdot u_1} u_1 + \cdots + \frac{y \cdot u_p}{u_p \cdot u_p} u_p \tag{3} \] and \[z = y - \hat{y}.\] Note: (3) is called the orthogonal projection of \(y\) onto \(W\) denoted by \(\text{proj}_W y\).

Proof: Let \(\{u_1, \cdots, u_p\}\) is any orthogonal basis of \(W\) and \(\hat{y} = \frac{y \cdot u_1}{u_1 \cdot u_1} u_1 + \cdots + \frac{y \cdot u_p}{u_p \cdot u_p} u_p\). Since \(\hat{y}\) is a linear combination of \(u_1, \cdots u_p\), \(\quad \hat{y} \in W\).
Let \(z = y - \hat{y}\). Because \(u_i\) is orthogonal to each \(u_j\) in the basis for \(W\) with \(i \neq j\), \[ \begin{align*} z \cdot u_i &= (y -\hat{y}) \cdot u_i \\\\ &= y \cdot u_i - (\frac{y \cdot u_i}{u_i \cdot u_i})u_i \cdot u_i - 0 - \cdots - 0 \\\\ &= y \cdot u_i - y \cdot u_i = 0 \end{align*} \] Hence \(z\) is orthogonal to each \(u_i\) in the basis for \(W\). Therefore, \(z\) is orthogonal to every vector in \(W\). In other words, \(z \in W^{\perp}\).
Next, suppose \(y\) can also be written as \(y = \hat{y_1} + z_1\) where \(y_1 \in W \text{ and } z_1 \in W^{\perp}\). Then \[\hat{y} +z = \hat{y_1} + z_1 \Longrightarrow \hat{y} - \hat{y_1} = z_1 - z \] This equation implies that the vector \((\hat{y} - \hat{y_1})\) is in both \(W\) and \(W^{\perp}\). Thus, since the only vector that can be in both \(W\) and \(W^{\perp}\) is the zero vector, \((\hat{y} - \hat{y_1}) = 0\). Therefore, \(\hat{y} = \hat{y_1} \text{ and } z = z_1\).

The orthogonal projection of \(y\) onto \(W\) is the best approximation to \(y\) by elements of W. This is significant in practical applications.

Theorem 6: Best Approximation Let \(W \subseteq \mathbb{R}^n\) and \(y \in \mathbb{R}^n\). Also let \(\hat{y}\) be the orthogonal projection of \(y\) onto \(W\). Then \(\hat{y}\) is the closest point in \(W\) to \(y\) such that \[\| y - \hat{y} \| < \| y - v \| \tag{4}\] for all \(v \in W\) distinct from \(\hat{y}\).

Proof: Choose \(v \in W\) distinct from \(\hat{y} \in W\). Then \((\hat{y} - v) \in W\). By the Orthogonal Decomposition Theorem, \((y - \hat{y})\) is orthogonal to \(W\). Moreover, \((y - \hat{y})\) is orthogonal to \((\hat{y} - v) \in W\) Consider \[y - v = (y - \hat{y}) + (\hat{y} -v). \] By the Pythagorean Theorem, \[\|y - v\|^2 = \|y - \hat{y}\|^2 + \|\hat{y} -v \|^2 > 0\] Thus, we can conclude \(\| y - \hat{y} \| < \| y - v \|\).

The formula (3) becomes much simpler when the basis for \(W\) is an orthonormal set as follows:
If \(\{u_1, \cdots, u_p\}\) is an orthonormal basis for \(W \subseteq \mathbb{R}^n\), then \[ \text{proj}_W y = (y \cdot u_1)u_1 + \cdots + (y \cdot u_p)u_p \] If \(U = [u_1, \cdots, u_p]\), then \[ \forall y \in \mathbb{R}^n, \quad \text{proj}_W y = UU^Ty \] because each weight can be written as \(u_1^Ty, \cdots, u_p^Ty\), which are the entries of \(U^Ty\).
Besides having orthonormal columns, if \(U\) is an \(n \times n\) square matrix (\(U\) is the orthogonal matrix), \[ \forall y \in \mathbb{R}^n, \quad \text{proj}_W y = UU^Ty = Iy = y. \] This implies the column space \(W\) is equal to \(\mathbb{R}^n\).

Gram-Schmidt Algorithm

The Gram-Schmidt algorithm generates an orthogonal (or orthonormal) basis for any nonzero subspace of \(\mathbb{R}^n\). Let us explore this process through a simple example.

Consider the basis \({x_1, x_2, x_3}\) for a nonzero subspace \(W\) of \(\mathbb{R}^3\), where: \(x_1 = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix}, \quad x_2 =\begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \text{ and } x_3 = \begin{bmatrix} 0 \\ 0 \\ 1\end{bmatrix} \)
Let \(v_1 = x_1\) and \(W_1 = Span \{v_1\}\). Then \[ \begin{align*} v_2 &= x_2 - \text{proj}_{W_1} x_2 \\\\ &= x_2 - \frac{x_2 \cdot v_1} {v_1 \cdot v_1}v_1 \\\\ &= \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix} - \frac{2}{14}\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \\\\ &= \begin{bmatrix} \frac{-1}{7} \\ \frac{5}{7} \\ \frac{-3}{7} \end{bmatrix} \end{align*} \]
Next, \(W_2 = Span \{v_1, v_2\}\) and then \[ \begin{align*} v_3 &= x_3 - \text{proj}_{W_2} x_3 \\\\ &= x_3 - \frac{x_3 \cdot v_1} {v_1 \cdot v_1}v_1 - \frac{x_3 \cdot v_2} {v_2 \cdot v_2}v_2 \\\\ &= \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} - \frac{3}{14}\begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} - \frac{\frac{-3}{7}}{\frac{5}{7}}\begin{bmatrix} \frac{-1}{7} \\ \frac{5}{7} \\ \frac{-3}{7} \end{bmatrix} \\\\ &= \begin{bmatrix} \frac{-3}{10} \\ 0 \\ \frac{1}{10} \end{bmatrix} \end{align*} \] Thus, the orthogonal basis for \(W\) is \(W_3 = Span\{v_1, v_2, v_3\}\).

Finally, we construct an orthonormal basis for \(W\) by nomalizing each \(v_k\): \[ \{u_1 = \begin{bmatrix} \frac{1}{\sqrt{14}} \\ \frac{2}{\sqrt{14}} \\ \frac{3}{\sqrt{14}} \end{bmatrix}, \quad u_2 = \begin{bmatrix} \frac{-1}{\sqrt{35}} \\ \frac{5}{\sqrt{35}} \\ \frac{-3}{\sqrt{35}} \end{bmatrix}, \quad u_3 = \begin{bmatrix} \frac{-3}{\sqrt{10}} \\ \ 0 \\ \frac{1}{\sqrt{10}} \end{bmatrix} \} \] The Gram-Schmidt algorithm at the \(k\)th step: \[ v_k = x_k - \sum_{j=1}^{k-1} (x_k \cdot u_j)u_j ,\qquad u_k = \frac{v_k}{\| v_k \|} \]
The resulting vectors look different from the original ones. This is because the algorithm systematically redefines each vector to be orthogonal to the previous ones while maintaining the span of the original vectors. Essentially, it transforms the basis into an orthonormal one without changing the space that the vectors describe.

QR Factorization

The QR factorization is a practical decomposition of matrices with linearly independent columns. It is widely used in numerical applications, such as eigenvalue estimation. (Note:Solving the characteristic polynomial is often impossible.)

Let \(A\) be an \(m \times n\) matrix with linearly independent columns. Then A can be factorized as \(A = QR\), where \(Q\) is an \(m \times n\) matrix with columns forming an orthonormal basis for \(Col A\) and \(R\) is an \(n \times n\) upper triangular invertible matrix with positive diagonal entries.

Consider \(A = \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 3 & 0 & 1 \end{bmatrix}\) again. By the GS algorithm, we have already obtained \(Q\) from the previous example: \[Q = \begin{bmatrix} \frac{1}{\sqrt{14}} & \frac{-1}{\sqrt{35}} & \frac{-3}{\sqrt{10}} \\ \frac{2}{\sqrt{14}} & \frac{5}{\sqrt{35}} & 0 \\ \frac{3}{\sqrt{14}} & \frac{-3}{\sqrt{35}} & \frac{1}{\sqrt{10}} \end{bmatrix}\] Since the columns of \(Q\) are orthonormal, \(Q^TQ=I\). Then, \(Q^TA = Q^TQR = IR = R\). \[ \begin{align*} R = Q^TA &= \begin{bmatrix} \frac{1}{\sqrt{14}} & \frac{2}{\sqrt{14}} & \frac{3}{\sqrt{14}} \\ \frac{-1}{\sqrt{35}} & \frac{5}{\sqrt{35}} & \frac{-3}{\sqrt{35}} \\ \frac{-3}{\sqrt{10}} & \ 0 & \frac{1}{\sqrt{10}} \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 2 & 1 & 0 \\ 3 & 0 & 1 \end{bmatrix} \\\\ &= \begin{bmatrix} \frac{14}{\sqrt{14}} & \frac{2}{\sqrt{14}} & \frac{3}{\sqrt{14}} \\ 0 & \frac{5}{\sqrt{35}} & \frac{-3}{\sqrt{35}} \\ 0 & 0 & \frac{1}{\sqrt{10}} \end{bmatrix} \end{align*} \]

Interactive Visualizer

This interactive demonstrates fundamental concepts like inner products, orthogonal decomposition, and orthogonalization that are essential for understanding many applications in machine learning.

In machine learning, orthogonality plays a critical role in feature engineering, dimensionality reduction, gradient-based optimization, and neural network initialization. Try changing the demo type in the dropdown menu above to explore different aspects of orthogonality!