Least-Squares

Linear Algebra to Algebraic Foundations

Least-Squares Problems Moore-Penrose Pseudo-inverse Linear Regression Interactive Linear Regression Demo

Least-Squares Problems

If \(A \in \mathbb{R}^{m \times n}\) and \(b \in \mathbb{R}^m\), a least-squares solution of \(Ax = b\) is an \(\hat{x} \in \mathbb{R}^n\) s.t. \[ \forall x \in \mathbb{R}^n, \quad \| b - A\hat{x} \| \leq \| b - Ax \|. \] We call the norm \(\| b - A\hat{x} \| \) the least-squares error of the approximation.

In practice, it is possible that a solution to the equation \(Ax = b\) is required but does not exist (i.e., the system is inconsistent). In that case, we try to find an \(x\) that makes \(Ax\) as close to \(b\) as possible. Let \[ \hat{b} = \text{proj}_{\text{Col }A} \, b \] Since \(\hat{b} \in \text{Col }A \,\), \(Ax = \hat{b}\) is consistent. This \(\hat{b}\) is the closest point to \(b\) in \(\text{Col }A\). Then, there exists a least-squares solution \(\hat{x} \in \mathbb{R}^n\) s.t. \(A\hat{x} = \hat{b}\).

Since \(\hat{b}\) is the orthogonal projection of \(b\) onto \(\text{Col }A\), we can say that \(b - \hat{b} = b - A\hat{x}\) is orthogonal to \(\text{Col }A \). (Note:\(b = A\hat{x} + (b - A\hat{x})\))

Let \(a_j\) be a column of \(A\), then \[ \begin{align*} a_j \cdot (b-A\hat{x}) = 0 &\Longrightarrow a_j^T(b-A\hat{x}) = 0 \\\\ &\Longrightarrow A^T (b-A\hat{x}) = 0 \end{align*} \] and we obtain \[A^TA\hat{x} = A^Tb\] So, the set of least-squares solutions of \(Ax =b\) must satisfy the normal equations \[A^TAx = A^Tb\]
If \(A^TA\) is invertible(= The columns of \(A\) are linearly independent.), \[\hat{x} = (A^TA)^{-1}A^Tb\] Finally, we also get the orthogonal projection of \(b\) onto the \(\text{Col } A\) \[ \begin{align*} \hat{b} &= \text{proj}_{\text{Col }A} \, b \\\\ &= A\hat{x} = A(A^TA)^{-1}A^Tb \end{align*} \] In practice, computing \(A^TA\) makes relatively large errors in \(\hat{x}\). Instead, the QR factorization gives us more reliable result. Let \(A = QR\). Then \[ \begin{align*} \hat{x} &= (R^TQ^TQR)^{-1}(R^TQ^T)b \\\\ &= (R^TR)^{-1}(R^TQ^T)b. \tag{1} \end{align*} \] On the other hand, \[ \begin{align*} A\hat{x} = b &\Longrightarrow Q^TQR\hat{x} = Q^Tb \\\\ &\Longrightarrow R\hat{x} = Q^Tb \\\\ &\Longrightarrow \hat{x} = R^{-1}Q^Tb \\\\ \end{align*} \] Comparing with (1), \((R^TR)^{-1}R^T = R^{-1}\). Thus, the unique least-squares solution can be generated by \(\hat{x} = R^{-1}Q^Tb\) but in practice, solving \[R\hat{x} = Q^Tb\] by back-substitution is much faster than computing \(R^{-1}\).

Moore-Penrose Pseudo-inverse (\(A^\dagger\))

In the previous section, we derived the normal equations \(A^TA\hat{x} = A^Tb\). If \(A^TA\) is invertible, the unique least-squares solution is \(\hat{x} = (A^TA)^{-1}A^Tb\).

However, this approach has two major problems in practice:

  1. Theoretical Failure: If the columns of \(A\) are linearly dependent (e.g., in an underdetermined system where \(n > m\), or due to collinear features), \(A^TA\) is singular and \((A^TA)^{-1}\) does not exist.
  2. Practical Failure: Even if \(A^TA\) is invertible, it can be ill-conditioned. As defined in Part 9, the condition number \(\kappa(A)\) measures a matrix's numerical sensitivity. The problem is that \(\kappa(A^TA) = \kappa(A)^2\). This squaring of the condition number is catastrophic. A moderately ill-conditioned problem (e.g., \(\kappa(A) = 10^5\)) becomes a hopelessly unstable one (\(\kappa(A^TA) = 10^{10}\)). This guarantees that solving for \(\hat{x}\) by computing \((A^TA)^{-1}\) will have a large numerical error.

Both issues are resolved by using the Moore-Penrose Pseudo-inverse, denoted \(A^\dagger\).

For any matrix \(A\), the Moore-Penrose Pseudo-inverse \(A^\dagger\) is the unique matrix that satisfies the following four algebraic conditions:

  1. \(A A^\dagger A = A\)
  2. \(A^\dagger A A^\dagger = A^\dagger\)
  3. \((A A^\dagger)^T = A A^\dagger\)     (The projection onto Col(A) is symmetric)
  4. \((A^\dagger A)^T = A^\dagger A\)     (The projection onto Row(A) is symmetric)

The Minimum-Norm Least-Squares Solution

The reason this unique matrix is so powerful is that the vector \(\hat{x} = A^\dagger b\) is always the unique minimum-norm least-squares solution to \(Ax=b\).

This means \(\hat{x}\) simultaneously satisfies two conditions:


In machine learning, this is incredibly powerful. For an underdetermined system (e.g., more features than data points), the minimum-norm solution is the "simplest" one, which is a form of implicit regularization.

The most stable way to compute \(A^\dagger\) is via the Singular Value Decomposition (SVD), which we will define in Part 9.

Linear Regression

Given a set of observed data points \(\{(x_i, y_i)\}_{i = 1}^{n}\) where \(x_i \in \mathbb{R}^d, \quad y_i \in \mathbb{R}\), we assume that the given data can be explained by the linear model: \[Y = X\beta + \epsilon \] where


The dimension \(d\) is the number of features, and \(n\) is the number of data points. The residual \(\epsilon\) is the "difference" between each observed value \(y_i\) and its corresponding predicted \(y\) value. The least-squares hyperplane represents the set of predicted \(y\) values based on "estimated" parameters(weights) \(\hat{\beta}\), and it must satisfy the normal equations: \[X^TX\hat{\beta} = X^TY\]

Note: the linear model is linear in terms of parameters \(\beta \, \), not \(X\). We can choose any non-linear transformation for each \(x_i\). For example, \(y = \beta_0 + \beta_1 x^2 + \beta_2 \sin(2\pi x)\) is still a linear model. Thus, \(Y\) is modeled as a linear combination of features(predictors) \(X\) with respect to the coefficients(weights) \(\beta\).

There are lots of details we should cover in this topic. I leave it for Section III-Probability & Statistics - Linear Regression.
Finally, \(X^TX\) is called a symmetric matrix. We will learn about the special matrices in the next part.

Interactive Linear Regression Demo

Linear regression finds the best-fitting line or curve for a set of data points by minimizing the sum of squared differences between observed and predicted values. This interactive demo lets you explore how polynomial regression works with different datasets.

Try adding your own data points by clicking on the plot, or use the example datasets to see how different polynomial degrees fit various patterns. The demo visually demonstrates key concepts from least-squares theory and the normal equations.