Regularized Regression

Ridge Regression

Let's first review linear regression. Suppose we are given observed data points \({\{(x_i, y_i)\}_{i = 1}^n}\), where each \(x_i \in \mathbb{R}^d\) is a feature vector, and \(y_i \in \mathbb{R}\) is the corresponding target value. The linear regression model assumes a linear relationship: \[ y = Xw + \epsilon, \] where \(X \in \mathbb{R}^{n \times d}\) is the design matrix, \(w \in \mathbb{R}^d\) is the weight vector to be estimated, and \(\epsilon \sim \mathcal{N}(0, \sigma^2 I)\) is a Gaussian noise vector.

In linear regression, the maximum likelihood estimator (MLE) of \(w\) coincides with the least-squares solution: \[ \begin{align*} \widehat{w}_{LS} &= \arg \min_w \sum_{i=1}^n (y_i - {x_i}^\top w)^2 \\\\ &= \arg \min_w \|y - Xw\|_2^2 \\\\ &= \arg \min_w (y - Xw)^\top (y - Xw) \\\\ &= \arg \min_w y^\top y -y^\top Xw - w^T X^T y + w^\top X^\top X w\\\\ &= \arg \min_w f(w) \\\\ \end{align*} \] Differentiating with respect to \(w\) and setting the gradient to zero: \[ \begin{align*} \nabla f(w) &= -X^\top y -X^\top y + 2X^\top Xw \\\\ &= -2X^\top y + 2 X^\top X w = 0 \end{align*} \] Thus, if \((X^\top X)^{-1}\) exists, \[ \begin{align*} \widehat{w}_{LS} &= (X^\top X)^{-1}X^\top y \\\\ &= \widehat{w}_{MLE}. \end{align*} \]

However, when \(d \gg n\), the matrix \(X^T X\) becomes ill-conditioned or singular. This leads to a flat objective surface in some directions, making the solution non-unique and highly sensitive to noise — a phenomenon known as overfitting.

To mitigate this issue, we add a penalty term on the objective. This technique is called regularization, and the modified version of linear regression is known as ridge regression: \[ \begin{align*} \widehat{w}_{ridge} &= \arg \min_w \sum_{i=1}^n (y_i - {x_i}^\top w)^2 +\underbrace{\lambda \|w\|_2^2}_{\text{Squared \(\ell_2\)-norm regularizer}}.\\\\ &= \arg \min_w \|y - Xw\|_2^2 + \lambda w^\top w \\\\ &= \arg \min_w (Xw -y)^\top (Xw-y)+\lambda w^\top w \\\\ &= \arg \min_w y^\top y -2w^\top Xy +w^\top X^\top Xw + w^\top (\lambda I)w \\\\ &= \arg \min_w y^\top y -2w^\top Xy +w^\top (X^\top X + \lambda I)w \\\ &= \arg \min_w J(w) \end{align*} \] Then \[ \nabla_w J(w) = -2X^\top y +2(X^\top X + \lambda I )w = 0 \] Solve this equation for \(w\), we obtain \[ \widehat{w}_{ridge} = (X^\top X + \lambda I)^{-1}X^\top y. \] where \(\lambda > 0\) is the regularization parameter that controls the strength of the penalty.

The hyperparameter \(\lambda \geq 0\) controls the strength of the regularization:

As \(\lambda \to 0\), the solution approaches the least-squares solution: \(\widehat{w}_{ridge} \to \widehat{w}_{LS}\).
As \(\lambda \to \infty\), the solution is pushed toward zero: \(\widehat{w}_{ridge} \to 0\), leading to high bias but low variance.

This illustrates the classic bias-variance tradeoff in statistical learning.

Demo: Ridge Regression

Note: Tiny regularization \(\lambda = 0.0001\) has been added on linear regression for numerical stability.

After running the demo above, we observe that linear regression tends to achieve lower training error but at the cost of higher test error and wildly large coefficients, especially with high polynomial degrees. This means the model memorizes the training data too closely and fails to generalize well to new, unseen data.

In contrast, ridge regression, by constraining the weights, may have slightly higher training error but typically generalizes better to new data. This trade-off between fitting the training data and maintaining model simplicity is known as the bias-variance tradeoff. Increasing the model complexity (e.g., by raising the polynomial degree) monotonically reduces training error, but after a certain point, the test error begins to increase. This turning point highlights the danger of overfitting.

Cross Validation

To build a generalized model and select an appropriate regularization strength (\(\lambda\)), we employ K-fold cross-validation (CV) on the training set. The training data is partitioned into \(K\) equal-sized folds. For each fold \(k \in \{1, \cdots, K\}\), the model is trained on the \(K - 1\) other folds and validated on the \(k\)th fold. This process is repeated \(K\) times, ensuring that each data point in the training set is used once for validation. The average validation error is used to select the optimal hyperparameter \(\lambda\). Finally, the model is evaluated on the held-out test set.

Here, we demonstrate how k-fold CV is used to tune the regularization parameter \(\lambda > 0\) in Ridge Regression.

An special case of K-fold CV is Leave-One-Out Cross-Validation (LOOCV), where the number of folds \(K\) equals the number of data points in the training set. That is, the model is trained on all training data points except one, which is used for validation, and this process is repeated for each data point. Thus, LOOCV tends to produce a low-bias estimate of the validation error. However, LOOCV is computationally expensive for large datasets. In contrast, smaller values of \(K\) (like 5 ~ 10) offer a good tradeoff between bias, variance, and computational cost, making them a popular choice in practice.

Bias-Variance Tradeoff in Ridge Regression

Let's dive into a formal derivation of expected prediction error in the context of ridge regression. Remember we have obtained: \[ \widehat{w}_{ridge} = (X^\top X + \lambda I)^{-1}X^\top y. \] We assume probabilistic generative model: \[ x_i \sim P_X, \quad y = Xw + \epsilon, \text{ where } \epsilon \sim \mathcal{N}(0, \sigma^2 I). \] The true error at a sample with feature \(x\) is given by: \[ \begin{align*} &\mathbb{E}_{y, D \mid x} \left[ \left(y - x^\top \widehat{w}_{\text{ridge}} \right)^2 \mid x \right] \\\\ &= \mathbb{E}_{y \mid x} \left[ \left(y - \mathbb{E}[y \mid x] \right)^2 \mid x \right] + \mathbb{E}_D \left[ \left( \mathbb{E}[y \mid x] -x^\top \widehat{w}_{ridge} \right)^2 \mid x \right] \\\\ &= \underbrace{ \mathbb{E}_{y \mid x} \left[ \left(y - x^\top w \right)^2 \mid x \right] }_{\text{Irreducible error (Noise)}} + \mathbb{E}_D \left[ \left(x^\top w - x^\top \widehat{w}_{ridge} \right)^2 \mid x \right] \\\\ &= \underbrace{\sigma^2}_{\text{Noise}} + \underbrace{\left(x^\top w - \mathbb{E}_D \left[x^\top \widehat{w}_{ridge} \mid x \right] \right)^2 }_{(\text{Bias})^2} + \underbrace{ \mathbb{E}_D \left[ \left( x^\top \widehat{w}_{ridge} - \mathbb{E}_D \left[ x^\top \widehat{w}_{ridge} \mid x\right] \right)^2 \mid x \right] }_{\text{Variance}} \tag{1}\\\\ \end{align*} \] Now we assume \(X^\top X = n I\) for simplicity. This corresponds to the case where the predictors are orthogonal (uncorrelated) and standardized, which is a common simplifying assumption in ridge regression analysis. For orthogonal covariates, the ridge estimator takes the simple form \(\widehat{w}_{ridge} = \frac{n}{n+\lambda} \widehat{w}_{LS}\), showing pure shrinkage behavior . While this assumption doesn't hold in general, it allows clear illustration of the bias-variance tradeoff without the complexity of correlated predictors. Then \[ \begin{align*} \widehat{w}_{ridge} &= (X^\top X + \lambda I)^{-1} X^\top (Xw + \epsilon) \\\\ &= (nI + \lambda I)^{-1}(nIw + X^\top \epsilon) \\\\ &= (n + \lambda)^{-1} I (nw + X^\top \epsilon) \\\\ &= \frac{n}{n + \lambda}w + \frac{1}{n + \lambda} X^\top \epsilon. \end{align*} \] Thus, \[ x^\top \widehat{w}_{ridge} = \underbrace{\frac{n}{n + \lambda}x^\top w}_{\text{Constant}} + \underbrace{\frac{1}{n + \lambda} (Xx)^\top \epsilon}_{\text{Random}}. \] Since \(\mathbb{E}[\epsilon] = 0\), the bias term of Expression 1 is written as: \[ \begin{align*} & \left( x^\top w -\mathbb{E}_D \left[x^\top \widehat{w}_{ridge}\right] \right)^2 \\\\ &= \left( x^\top w -\mathbb{E}_D \left[\frac{n}{n + \lambda}x^\top w + \frac{1}{n + \lambda} (Xx)^\top \epsilon \right] \right)^2 \\\\ &= \left( x^\top w - \frac{n}{n + \lambda}x^\top w \right)^2 \\\\ &= \frac{\lambda^2}{(n + \lambda)^2}(x^\top w)^2. \end{align*} \] Since the variance of a constant is always zero, the variance term of Expression 1 is written as: \[ \begin{align*} &\mathbb{E}_D \left[ \left( x^\top \widehat{w}_{ridge} - \mathbb{E}_D \left[ x^\top \widehat{w}_{ridge} \mid x\right] \right)^2 \mid x \right] \\\\ &= \text{Var}\left[ \frac{1}{n + \lambda} (Xx)^\top \epsilon \right] \\\\ &= \frac{1}{(n + \lambda)^2}\text{Var}\left[(Xx)^\top \epsilon \right] \\\\ &= \frac{1}{(n + \lambda)^2}(Xx)^\top \text{Var}[\epsilon] (Xx) \\\\ &= \frac{1}{(n + \lambda)^2} x^\top X^\top (\sigma^2 I) X x \\\\ &= \frac{\sigma^2}{(n + \lambda)^2} x^\top (X^\top X) x \\\\ &= \frac{\sigma^2}{(n + \lambda)^2} x^\top (n I) x \\\\ &= \frac{\sigma^2 n}{(n + \lambda)^2} \| x \|_2^2 \\\\ \end{align*} \] Therefore, putting it all together: \[ \begin{align*} &\mathbb{E}_{y, D \mid x} \left[ \left(y - x^\top \widehat{w}_{\text{ridge}} \right)^2 \mid x \right] \\\\ &= \sigma^2 + \frac{\lambda^2}{(n + \lambda)^2}(w^\top x)^2 + \frac{\sigma^2 n}{(n + \lambda)^2} \| x \|_2^2 \end{align*} \]

Lasso Regression

In ridge regression objective, squared \(\ell_2\)-norm, \( \| w \|_2^2 \) is introduced as a regularizer. Instead, here we use the \(\ell_1\) penalty (or \(\ell_1\) regularization term): \[ \widehat{w}_{\text{lasso}} = \arg \min_w \sum_{i=1}^n (y_i - {x_i}^\top w)^2 + \lambda \|w\|_1. \] This formulation is known as the Least absolute shrinkage & selection operator (Lasso) regression.

The significance of Lasso is that it automatically performs feature selection. By penalizing the \(\ell_1\)-norm of the weight vector, Lasso often drives many \(w_j\) to exactly zero, effectively dropping those features from the model. Intuitively, this is because the \(\ell_1\) regularizer has “pointy” level sets, while the \(\ell_2\) regularizer shrinks all coefficients "smoothly" toward zero but never exactly to zero.

In Lasso regression, after feature selection with the optimal \(\lambda\), we have to "retrain" the objective with \(\lambda = 0\) (i.e. solving ordinary least-squares) but only using selected features. For example, when we run Lasso, generate an index set: \[ S = \{j : \widehat{w}_{\text{lasso}, j} \neq 0\} \] and then solve \[ \tilde{w} = \arg \min_{w_S} \sum_{i=1}^n (y_i - x_{i, S}^\top w_S)^2 \] where \(x_{i, S}\) is the subvector of features in \(S\) and we set \(\tilde{w}_j = 0\) for \(j \notin S\).

Lasso vs Dimensionality Reduction:

Feature selection via Lasso sounds like dimentionality reduction (e.g. PCA) because both reduce the number of active predictors, but in fundamentally they are different.

Let's assume that We have 100 variables originally and our model uses 10 variables in the end.

In Lasso: the model uses 10 out of 100 original variables
In PCA: the model uses 10 new variables

we lose direct feature interpretability

For instance, when there are more features than samples, Lasso can still fit a model by zeroing out most coefficients, whereas PCA & regression can overfit or require careful component-selection heuristics.

In practice you might even combine them - run PCA to a moderate number of components and then apply Lasso on those - if you need both compression and sparsity.

Machine Learning

Ridge Regression

Demo: Ridge Regression

Cross Validation

Bias-Variance Tradeoff in Ridge Regression

Lasso Regression