Least-Squares

Least-Squares Problems Linear Regression Interactive Linear Regression Demo

Least-Squares Problems

If \(A \in \mathbb{R}^{m \times n}\) and \(b \in \mathbb{R}^m\), a least-squares solution of \(Ax = b\) is an \(\hat{x} \in \mathbb{R}^n\) s.t. \[\forall x \in \mathbb{R}^n, \quad \| b - A\hat{x} \| \leq \| b - Ax \|. \] We call the norm \(\| b - A\hat{x} \| \) the least-squares error of the approximation.

In practice, it is possible that a solution of some problem \(Ax = b\) is required but it does not exist (= the system is inconsistent). In that case, we try to find an \(x\) that makes \(Ax\) close to \(b\) as much as possible. Let \[\hat{b} = \text{proj}_{\text{Col }A} \, b \] Since \(\hat{b} \in \text{Col }A \,\), \(Ax = \hat{b}\) is consistent. This \(\hat{b}\) is the closest point to \(b\) in \(\text{Col }A\). Then, there exists a least-squares solution \(\hat{x} \in \mathbb{R}^n\) s.t. \(A\hat{x} = \hat{b}\).

Since \(\hat{b}\) is orthogonal to \(b\), we can say that \(b - \hat{b} = b - A\hat{x}\) is orthognal to \(\text{Col }A\). (Note:\(b = A\hat{x} + (b - A\hat{x})\))
Let \(a_j\) be a column of \(A\), then \[ \begin{align*} a_j \cdot (b-A\hat{x}) = 0 &\Longrightarrow a_j^T(b-A\hat{x}) = 0 \\\\ &\Longrightarrow A^T (b-A\hat{x}) = 0 \end{align*} \] and we obtain \[A^TA\hat{x} = A^Tb\] So, the set of least-squares solutions of \(Ax =b\) must satisfy the normal equations \[A^TAx = A^Tb\]
If \(A^TA\) is invertible(= The columns of \(A\) are linearly independent.), \[\hat{x} = (A^TA)^{-1}A^Tb\] Finally, we also get the orthogonal projection of \(b\) onto the \(\text{Col } A\) \[ \begin{align*} \hat{b} &= \text{proj}_{\text{Col }A} \, b \\\\ &= A\hat{x} = A(A^TA)^{-1}A^Tb \end{align*} \] In practice, computing \(A^TA\) makes relatively large errors in \(\hat{x}\). Instead, the QR factorization gives us more reliable result. Let \(A = QR\). Then \[ \begin{align*} \hat{x} &= (R^TQ^TQR)^{-1}(R^TQ^T)b \\\\ &= (R^TR)^{-1}(R^TQ^T)b. \tag{1} \end{align*} \] On the other hand, \[ \begin{align*} A\hat{x} = b &\Longrightarrow Q^TQR\hat{x} = Q^Tb \\\\ &\Longrightarrow R\hat{x} = Q^Tb \\\\ &\Longrightarrow \hat{x} = R^{-1}Q^Tb \\\\ \end{align*} \] Comparing with (1), \((R^TR)^{-1}R^T = R^{-1}\). Thus, the unique least-squares solution can be generated by \(\hat{x} = R^{-1}Q^Tb\) but in practice, solving \[R\hat{x} = Q^Tb\] by back-substitution is much faster than computing \(R^{-1}\).

Linear Regression

Given a set of observed data points \(\{(x_i, y_i)\}_{i = 1}^{n}\) where \(x_i \in \mathbb{R}^d, \quad y_i \in \mathbb{R}\), we assume that the given data can be explained by the linear model: \[Y = X\beta + \epsilon \] where


The dimension \(d\) is the number of features, and \(n\) is the number of data points. The residual \(\epsilon\) is the "difference" between each observed value \(y_i\) and its corresponding predicted \(y\) value. The least-squares hyperplane represents the set of predicted \(y\) values based on "estimated" parameters(weights) \(\hat{\beta}\), and it must satisfy the normal equations: \[X^TX\hat{\beta} = X^TY\]

Note: the linear model is linear in terms of parameters \(\beta \, \) , not \(X\). We can choose any non-linear transformation for each \(x_i\). For example, \(y = \beta_0 + \beta_1 x^2 + \beta_2 \sin(2\pi x)\) is linear. So, \(Y\) is modeled as a linear combination of features(predictors) \(X\) with respect to the coefficients(weights) \(\beta\).

There are lots of details we should cover in this topic. I leave it for Section III-Probability & Statistics - Linear Regression.
Finally, \(X^TX\) is called a symmetric matrix. We will learn about the special matrices in the next part.

Interactive Linear Regression Demo

Linear regression finds the best-fitting line or curve for a set of data points by minimizing the sum of squared differences between observed and predicted values. This interactive demo lets you explore how polynomial regression works with different datasets.

Try adding your own data points by clicking on the plot, or use the example datasets to see how different polynomial degrees fit various patterns. The demo visually demonstrates key concepts from least-squares theory and the normal equations.