Linear Regression

Recap from Linear Algebra Linear Regression: A Probabilistic Perspective Interactive Statistical Regression Tool

Recap from Linear Algebra

Given a set of observed data points \(\{(x_i, y_i)\}_{i = 1}^{n}\) where \(x_i \in \mathbb{R}^d, \quad y_i \in \mathbb{R}\), we assume that the given data can be explained by the linear model: \[y = X\beta + \epsilon \tag{1} \] where \(X \in \mathbb{R}^{n \times d}\) is the design matrix, \(\beta \in \mathbb{R}^d\) is the parameter(weight) vector, \(y \in \mathbb{R}^n\) is the observation vector, and \(\epsilon = y - X\beta\) is a residual vector.

The dimension \(d\) is the number of features, and \(n\) is the number of data points. The residual \(\epsilon\) is the "difference" between each observed value \(y_i\) and its corresponding predicted \(y\) value. The least-squares hyperplane represents the set of predicted \(y\) values based on "estimated" parameters(weights) \(\hat{\beta}\), and it must satisfy the normal equations: \[ X^TX\hat{\beta} = X^Ty \tag{2} \]

Note: the linear model is linear in terms of parameters \(\beta \, \) , not \(X\). We can choose any non-linear transformation for each \(x_i\). For example, \(y = \beta_0 + \beta_1 x^2 + \beta_2 \sin(2\pi x)\) is linear. So, \(y\) is modeled as a linear combination of features(predictors) \(X\) with respect to the coefficients(weights) \(\beta\).

Also, check: Least-Squares Problems.

Linear Regression: A Probabilistic Perspective

We consider a probabilistic model for linear regression. We assume that \(y_i\) is the observed value of the random variable \(Y_i\) and it depends on the predictor \(x_i\). In addition, the random error\(\epsilon_i\) is an i.i.d. random variable following \(\epsilon_i \sim N(0, \sigma^2)\). Then since \(\mathbb{E}[\epsilon_i]=0\), the unknown mean of \(Y_i\) can be represented as \[ \mathbb{E }[Y_i] = \mu_i = x_i^T \beta \] where \(\beta \in \mathbb{R}^d\) represents the unknown parameters of the regression model.
This relationship defines the true regression line between \(\mathbb{E }[Y_i]\) and \(x_i\). Here, \(Y_i\) is the independent random variable \(Y_i \sim N(\mu_i, \sigma^2)\).
In other world, the conditional p.d.f of \(y_i\) is given by \[ p(y_i | x_i, \beta, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} \exp \Big\{-\frac{1}{2\sigma^2}(y_i - x_i^T \beta)^2 \Big\} \] and its likelihood function of \(\beta\) for fixed \(\sigma^2\) is given by \[ \begin{align*} L(\beta) &= \prod_{i =1}^n \Big[ \frac{1}{\sigma \sqrt{2\pi}} \exp \Big\{-\frac{1}{2\sigma^2}(y_i - x_i^T \beta)^2 \Big\} \Big] \\\\ &= \Big(\frac{1}{\sigma \sqrt{2\pi}}\Big)^n \exp \Big\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - x_i^T \beta)^2 \Big\}. \end{align*} \] The log-likelihood fuction is given by \[ \ln L(\beta) = n \ln \Big(\frac{1}{\sigma \sqrt{2\pi}}\Big) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - x_i^T \beta)^2 \] Setting the derivative with respect to \(\beta\) equal to zero: \[ \begin{align*} & \nabla_{\beta} = -\frac{1}{\sigma^2} \sum_{i=1}^n (y_i - x_i^T \beta) \cdot x_i = 0 \\\\ &\Longrightarrow \sum_{i=1}^n (x_i^T \beta -y_i) \cdot x_i = 0 \\\\ &\Longrightarrow (\sum_{i=1}^n x_ix_i^T)\beta - \sum_{i=1}^n x_i y_i = 0 \\\\ &\Longrightarrow X^TX \beta = X^Ty \end{align*} \] This is equivalent to the normal equations (2) and thus, the MLE solution for linear regression corresponds to the least-squares solution: \[ \begin{align*} \hat{\beta}_{MLE} &= \Big(\sum_{i=1}^n x_ix_i^T \Big)^{-1}\Big(\sum_{i=1}^n x_i y_i \Big) \\\\ &=(X^TX)^{-1}X^Ty \\\\ &= \hat{\beta}_{LS} \end{align*} \]

Least-Squares Solution Consider the linear model (1) where \(X \in \mathbb{R}^{n \times d}\), \(\, y \in \mathbb{R}^n\), \(\, \beta \in \mathbb{R}^d\).
To obtain the least-squares solution \(\hat{\beta}_{LS}\), we minimize the least-squares error: \[ \begin{align*} \hat{\beta}_{LS} &= \arg \min_{\beta} \| y - X \beta \|_{2}^2 \\\\ &= \arg \min_{\beta} (y - X \beta)^T (y - X \beta) \\\\ &= \arg \min_{\beta} y^T y - y^T X \beta - \beta^T X^T y + \beta^T X^T X \beta \\\\ &= \arg \min_{\beta} f(\beta) \end{align*} \] Differenciating \(f(\beta)\) and setting it equal to zero: \[ \begin{align*} & df = - X^Ty - X^Ty + 2X^TX\beta \\\\ &\Longrightarrow -2 X^Ty + 2X^TX\beta = 0 \\\\ &\Longrightarrow \hat{\beta}_{LS} = (X^TX)^{-1}X^Ty \\\\ \end{align*} \] Note: The matrix \(X^TX\) is symmetric.
If you are not familiar with Matrix Calculus, See: Matrix Calculus.

Interactive Statistical Regression Tool

This interactive tool allows you to explore linear regression from a statistical perspective. You can upload your own data or use the provided sample datasets to analyze regression models, check statistical significance, and perform diagnostics.