Regularized Regression

Ridge Regression Demo: Ridge Regression Cross Validation Bias-Variance Tradeoff in Ridge Regression Lasso Regression

Ridge Regression

Let's first review linear regression. Suppose we are given observed data points \({\{(x_i, y_i)\}_{i = 1}^n}\), where each \(x_i \in \mathbb{R}^d\) is a feature vector, and \(y_i \in \mathbb{R}\) is the corresponding target value. The linear regression model assumes a linear relationship: \[ y = Xw + \epsilon, \] where \(X \in \mathbb{R}^{n \times d}\) is the design matrix, \(w \in \mathbb{R}^d\) is the weight vector to be estimated, and \(\epsilon \sim \mathcal{N}(0, \sigma^2 I)\) is a Gaussian noise vector.

In linear regression, the maximum likelihood estimator (MLE) of \(w\) coincides with the least-squares solution: \[ \begin{align*} \widehat{w}_{LS} &= \arg \min_w \sum_{i=1}^n (y_i - {x_i}^\top w)^2 \\\\ &= \arg \min_w \|y - Xw\|_2^2 \\\\ &= \arg \min_w (y - Xw)^\top (y - Xw) \\\\ &= \arg \min_w y^\top y -y^\top Xw - w^T X^T y + w^\top X^\top X w\\\\ &= \arg \min_w f(w) \\\\ \end{align*} \] Differentiating with respect to \(w\) and setting the gradient to zero: \[ \begin{align*} \nabla f(w) &= -X^\top y -X^\top y + 2X^\top Xw \\\\ &= -2X^\top y + 2 X^\top X w = 0 \end{align*} \] Thus, if \((X^\top X)^{-1}\) exists, \[ \begin{align*} \widehat{w}_{LS} &= (X^\top X)^{-1}X^\top y \\\\ &= \widehat{w}_{MLE}. \end{align*} \]

However, when \(d \gg n\), the matrix \(X^T X\) becomes ill-conditioned or singular. This leads to a flat objective surface in some directions, making the solution non-unique and highly sensitive to noise — a phenomenon known as overfitting.

To mitigate this issue, we add a penalty term on the objective. This technique is called regularization, and the modified version of linear regression is known as ridge regression: \[ \begin{align*} \widehat{w}_{ridge} &= \arg \min_w \sum_{i=1}^n (y_i - {x_i}^\top w)^2 +\underbrace{\lambda \|w\|_2^2}_{\text{Squared \(\ell_2\)-norm regularizer}}.\\\\ &= \arg \min_w \|y - Xw\|_2^2 + \lambda w^\top w \\\\ &= \arg \min_w (Xw -y)^\top (Xw-y)+\lambda w^\top w \\\\ &= \arg \min_w y^\top y -2w^\top Xy +w^\top X^\top Xw + w^\top (\lambda I)w \\\\ &= \arg \min_w y^\top y -2w^\top Xy +w^\top (X^\top X + \lambda I)w \\\ &= \arg \min_w J(W) \end{align*} \] Then \[ \nabla_w J(w) = -2X^\top y +2(X^\top X + \lambda I )w = 0 \] Solve this equation for \(w\), we botain \[ \widehat{w}_{ridge} = (X^\top X + \lambda I)^{-1}X^\top y. \]

The hyperparameter \(\lambda \geq 0\) controls the strength of the regularization:


This illustrates the classic bias-variance tradeoff in statistical learning.

Demo: Ridge Regression

Note: Tiny regularization \(\lambda = 0.0001\) has been added on linear regression for numerical stability.

After running the demo above, we observe that linear regression tends to achieve lower training error but at the cost of higher test error and wildly large coefficients, especially with high polynomial degrees. This means the model memorizes the training data too closely and fails to generalize well to new, unseen data.

In contrast, ridge regression, by constraining the weights, may have slightly higher training error but typically generalizes better to new data. This trade-off between fitting the training data and maintaining model simplicity is known as the bias-variance tradeoff. Increasing the model complexity (e.g., by raising the polynomial degree) monotonically reduces training error, but after a certain point, the test error begins to increase. This turning point highlights the danger of overfitting.

Cross Validation

To build a generalized model and select an appropriate regularization strength (\(\lambda\)), we employ K-fold cross-validation (CV) on the training set. The training data is partitioned into \(K\) equal-sized folds. For each fold \(k \in \{1, \cdots, K\}\), the model is trained on the \(K - 1\) other folds and validated on the \(k\)th fold. This process is repeated \(K\) times, ensuring that each data point in the training set is used once for validation. The average validation error is used to select the optimal hyperparameter \(\lambda\). Finally, the model is evaluated on the held-out test set.

Here, we demonstrate how k-fold CV is used to tune the regularization parameter \(\lambda\) in Ridge Regression.

An special case of K-fold CV is Leave-One-Out Cross-Validation (LOOCV), where the number of folds \(K\) equals the number of data points in the training set. That is, the model is trained on all training data points except one, which is used for validation, and this process is repeated for each data point. Thus, LOOCV tends to produce a low-bias estimate of the validation error. However, LOOCV is computationally expensive for large datasets. In contrast, smaller values of \(K\) (like 5 ~ 10) offer a good tradeoff between bias, variance, and computational cost, making them a popular choice in practice.

Bias-Variance Tradeoff in Ridge Regression

Let's dive into a formal derivation of expected prediction error in the context of ridge regression. Remember we have obtained: \[ \widehat{w}_{ridge} = (X^\top X + \lambda I)^{-1}X^\top y. \] We assume probabilistic generative model: \[ x_i \sim P_X, \quad y = Xw + \epsilon, \text{ where } \epsilon \sim \mathcal{N}(0, \sigma^2 I). \] The true error at a sample with feature \(x\) is given by: \[ \begin{align*} &\mathbb{E}_{y, D \mid x} \left[ \left(y - x^\top \widehat{w}_{\text{ridge}} \right)^2 \mid x \right] \\\\ &= \mathbb{E}_{y \mid x} \left[ \left(y - \mathbb{E}[y \mid x] \right)^2 \mid x \right] + \mathbb{E}_D \left[ \left( \mathbb{E}[y \mid x] -x^\top \widehat{w}_{ridge} \right)^2 \mid x \right] \\\\ &= \underbrace{ \mathbb{E}_{y \mid x} \left[ \left(y - x^\top w \right)^2 \mid x \right] }_{\text{Irreducible error (Noise)}} + \mathbb{E}_D \left[ \left(x^\top w - x^\top \widehat{w}_{ridge} \right)^2 \mid x \right] \\\\ &= \underbrace{\sigma^2}_{\text{Noise}} + \underbrace{\left(x^\top w - \mathbb{E}_D \left[x^\top \widehat{w}_{ridge} \mid x \right] \right)^2 }_{(\text{Bias})^2} + \underbrace{ \mathbb{E}_D \left[ \left( x^\top \widehat{w}_{ridge} - \mathbb{E}_D \left[ x^\top \widehat{w}_{ridge} \mid x\right] \right)^2 \mid x \right] }_{\text{Variance}} \tag{1}\\\\ \end{align*} \] Now we assume \(X^\top X = n I\) for simplicity. (This corresponds to assuming that the columns of \(X\) are orthonormal and that we have exactly \(n\) samples, or in practice that we've whitened the design matrix and scaled it so \(X^\top X \approx n I\).) Then \[ \begin{align*} \widehat{w}_{ridge} &= (X^\top X + \lambda I)^{-1} X^\top (Xw + \epsilon) \\\\ &= (nI + \lambda I)^{-1}(nIw + X^\top \epsilon) \\\\ &= (n + \lambda)^{-1} I (nw + X^\top \epsilon) \\\\ &= \frac{n}{n + \lambda}w + \frac{1}{n + \lambda} X^\top \epsilon. \end{align*} \] Thus, \[ x^\top \widehat{w}_{ridge} = \underbrace{\frac{n}{n + \lambda}x^\top w}_{\text{Constant}} + \underbrace{\frac{1}{n + \lambda} (Xx)^\top \epsilon}_{\text{Random}}. \] Since \(\mathbb{E}[\epsilon] = 0\), the bias term of Expression 1 is written as: \[ \begin{align*} & \left( x^\top w -\mathbb{E}_D \left[x^\top \widehat{w}_{ridge}\right] \right)^2 \\\\ &= \left( x^\top w -\mathbb{E}_D \left[\frac{n}{n + \lambda}x^\top w + \frac{1}{n + \lambda} (Xx)^\top \epsilon \right] \right)^2 \\\\ &= \left( x^\top w - \frac{n}{n + \lambda}x^\top w \right)^2 \\\\ &= \frac{\lambda^2}{(n + \lambda)^2}(x^\top w)^2. \end{align*} \] Since the variance of a constant is always zero, the variance term of Expression 1 is written as: \[ \begin{align*} &\mathbb{E}_D \left[ \left( x^\top \widehat{w}_{ridge} - \mathbb{E}_D \left[ x^\top \widehat{w}_{ridge} \mid x\right] \right)^2 \mid x \right] \\\\ &= \text{Var}\left[ \frac{1}{n + \lambda} (Xx)^\top \epsilon \right] \\\\ &= \frac{1}{(n + \lambda)^2}\text{Var}\left[(Xx)^\top \epsilon \right] \\\\ &= \frac{1}{(n + \lambda)^2}(Xx)^\top \text{Var}[\epsilon] (Xx) \\\\ &= \frac{1}{(n + \lambda)^2} x^\top X^\top (\sigma^2 I) X x \\\\ &= \frac{\sigma^2}{(n + \lambda)^2} x^\top (X^\top X) x \\\\ &= \frac{\sigma^2}{(n + \lambda)^2} x^\top (n I) x \\\\ &= \frac{\sigma^2 n}{(n + \lambda)^2} \| x \|_2^2 \\\\ \end{align*} \] Therefore, putting it all together: \[ \begin{align*} &\mathbb{E}_{y, D \mid x} \left[ \left(y - x^\top \widehat{w}_{\text{ridge}} \right)^2 \mid x \right] \\\\ &= \sigma^2 + \frac{\lambda^2}{(n + \lambda)^2}(w^\top x)^2 + \frac{\sigma^2 n}{(n + \lambda)^2} \| x \|_2^2 \end{align*} \]

Lasso Regression

In ridge regression objective, squared \(\ell_2\)-norm, \( \| w \|_2^2 \) is introduced as a regularizer. Instead, here we use the \(\ell_1\) regularizer: \[ \widehat{w}_{\text{lasso}} = \arg \min_w \sum_{i=1}^n (y_i - {x_i}^\top w)^2 + \lambda \|w\|_1. \] This formulation is known as the Least absolute shrinkage & selection operator (Lasso) regression.

The significance of Lasso is that it automatically performs feature selection. By penalizing the \(\ell_1\)-norm of the weight vector, Lasso often drives many \(w_j\) to exactly zero, effectively dropping those features from the model. Intuitively, this is because the \(\ell_1\) regularizer has “pointy” level sets, while the \(\ell_2\) regularizer shrinks all coefficients "smoothly" toward zero but never exactly to zero.

In Lasso regression, after feature selection with the optimal \(\lambda\), we have to "retrain" the objective with \(\lambda = 0\) (i.e. solving ordinary least-squares) but only using selected features. For example, when we run Lasso, generate an index set: \[ S = \{j : \widehat{w}_{\text{lasso}, j} \neq 0\} \] and then solve \[ \tilde{w} = \arg \min_{w_S} \sum_{i=1}^n (y_i - x_{i, S}^\top w_S)^2 \] where \(x_{i, S}\) is the subvector of features in \(S\) and we set \(\tilde{w}_j = 0\) for \(j \notin S\).

Lasso vs Dimensionality Reduction:

Feature selection via Lasso sounds like dimentionality reduction (e.g. PCA) because both reduce the number of active predictors, but in fundamentally they are different.

Let's assume that We have 100 variables originally and our model uses 10 variables in the end.

  • In Lasso: the model uses 10 out of 100 original variables
  •     If the true signal lies in a combination of many small effects, pure feature dropping can hurt performance.
  • In PCA: the model uses 10 new variables
  •     Captures directions of maximal variance even if no single original variable is strongly predictive and improves numerical stability when features are highly correlated, but we lose direct feature interpretability.

For instance, when there are more features than samples, Lasso can still fit a model by zeroing out most coefficients, whereas PCA & regression can overfit or require careful component-selection heuristics.

In practice you might even combine them - run PCA to a moderate number of components and then apply Lasso on those - if you need both compression and sparsity.