Recap from Linear Algebra
Given a set of observed data points \(\{(x_i, y_i)\}_{i = 1}^{n}\) where \(x_i \in \mathbb{R}^d, \quad y_i \in \mathbb{R}\),
we assume that the given data can be explained by the linear model:
\[y = X\beta + \epsilon \tag{1} \]
where \(X \in \mathbb{R}^{n \times d}\) is the design matrix, \(\beta \in \mathbb{R}^d\) is the parameter(weight) vector, \(y \in \mathbb{R}^n\) is the
observation vector, and \(\epsilon = y - X\beta\) is a residual vector.
The dimension \(d\) is the number of features, and \(n\) is the number of data points.
The residual \(\epsilon\) is the "difference" between each observed value \(y_i\) and its corresponding predicted \(y\) value.
The least-squares hyperplane represents the set of predicted \(y\) values based on "estimated" parameters(weights) \(\hat{\beta}\),
and it must satisfy the normal equations:
\[
X^TX\hat{\beta} = X^Ty \tag{2}
\]
Note: the linear model is linear in terms of parameters \(\beta \, \) , not \(X\). We can choose any non-linear transformation
for each \(x_i\). For example, \(y = \beta_0 + \beta_1 x^2 + \beta_2 \sin(2\pi x)\) is linear. So, \(y\) is modeled as a
linear combination of features(predictors) \(X\) with respect to the coefficients(weights) \(\beta\).
Also, check: Least-Squares Problems.
Linear Regression: A Probabilistic Perspective
We consider a probabilistic model for linear regression. We assume that \(y_i\) is the observed value of
the random variable \(Y_i\) and it depends on the predictor \(x_i\). In addition, the random error\(\epsilon_i\) is
an i.i.d. random variable following \(\epsilon_i \sim N(0, \sigma^2)\). Then since \(\mathbb{E}[\epsilon_i]=0\), the unknown
mean of \(Y_i\) can be represented as
\[
\mathbb{E }[Y_i] = \mu_i = x_i^T \beta
\]
where \(\beta \in \mathbb{R}^d\) represents the unknown parameters of the regression model.
This relationship defines the true regression line between \(\mathbb{E }[Y_i]\) and \(x_i\). Here, \(Y_i\) is
the independent random variable \(Y_i \sim N(\mu_i, \sigma^2)\).
In other world, the conditional p.d.f of \(y_i\) is given by
\[
p(y_i | x_i, \beta, \sigma^2) = \frac{1}{\sigma \sqrt{2\pi}} \exp \Big\{-\frac{1}{2\sigma^2}(y_i - x_i^T \beta)^2 \Big\}
\]
and its likelihood function of \(\beta\) for fixed \(\sigma^2\) is given by
\[
\begin{align*}
L(\beta) &= \prod_{i =1}^n \Big[ \frac{1}{\sigma \sqrt{2\pi}} \exp \Big\{-\frac{1}{2\sigma^2}(y_i - x_i^T \beta)^2 \Big\} \Big] \\\\
&= \Big(\frac{1}{\sigma \sqrt{2\pi}}\Big)^n \exp \Big\{-\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - x_i^T \beta)^2 \Big\}.
\end{align*}
\]
The log-likelihood fuction is given by
\[
\ln L(\beta) = n \ln \Big(\frac{1}{\sigma \sqrt{2\pi}}\Big) -\frac{1}{2\sigma^2} \sum_{i=1}^n (y_i - x_i^T \beta)^2
\]
Setting the derivative with respect to \(\beta\) equal to zero:
\[
\begin{align*}
& \nabla_{\beta} = -\frac{1}{\sigma^2} \sum_{i=1}^n (y_i - x_i^T \beta) \cdot x_i = 0 \\\\
&\Longrightarrow \sum_{i=1}^n (x_i^T \beta -y_i) \cdot x_i = 0 \\\\
&\Longrightarrow (\sum_{i=1}^n x_ix_i^T)\beta - \sum_{i=1}^n x_i y_i = 0 \\\\
&\Longrightarrow X^TX \beta = X^Ty
\end{align*}
\]
This is equivalent to the normal equations (2) and thus, the MLE solution for linear regression corresponds to
the least-squares solution:
\[
\begin{align*}
\hat{\beta}_{MLE} &= \Big(\sum_{i=1}^n x_ix_i^T \Big)^{-1}\Big(\sum_{i=1}^n x_i y_i \Big) \\\\
&=(X^TX)^{-1}X^Ty \\\\
&= \hat{\beta}_{LS}
\end{align*}
\]
Least-Squares Solution
Consider the linear model (1) where \(X \in \mathbb{R}^{n \times d}\), \(\, y \in \mathbb{R}^n\),
\(\, \beta \in \mathbb{R}^d\).
To obtain the least-squares solution \(\hat{\beta}_{LS}\), we minimize the
least-squares error:
\[
\begin{align*}
\hat{\beta}_{LS} &= \arg \min_{\beta} \| y - X \beta \|_{2}^2 \\\\
&= \arg \min_{\beta} (y - X \beta)^T (y - X \beta) \\\\
&= \arg \min_{\beta} y^T y - y^T X \beta - \beta^T X^T y + \beta^T X^T X \beta \\\\
&= \arg \min_{\beta} f(\beta)
\end{align*}
\]
Differenciating \(f(\beta)\) and setting it equal to zero:
\[
\begin{align*}
& df = - X^Ty - X^Ty + 2X^TX\beta \\\\
&\Longrightarrow -2 X^Ty + 2X^TX\beta = 0 \\\\
&\Longrightarrow \hat{\beta}_{LS} = (X^TX)^{-1}X^Ty \\\\
\end{align*}
\]
Note: The matrix \(X^TX\) is symmetric.
If you are not familiar with Matrix Calculus, See:
Matrix Calculus.