In machine learning, parameter estimation also known as model fitting requires solving
optimization problem:
\[
\theta^* \in \arg \min_{\theta \in \Theta} \mathcal{L}(\theta), \qquad \Theta \subset \mathbb{R}^d
\]
where \(\mathcal{L}(\theta)\) is a loss function(objective function), \(\Theta\) is a parameter space, and
\(d\) is the number of variables being optimized over.
Since finding a global optimum is too expensive or impossible in practice, our target will be a local optimum.
A point \(\theta^*\) is a local minimum if:
\[
\exists \, \delta > 0, \, \forall \, \theta \in \Theta \text{ s.t. } \| \theta - \theta^* \| < \delta, \, \mathcal{L}(\theta^*) \leq \mathcal{L}(\theta).
\]
For a continuously twice differentiable function \(\mathcal{L}(\theta)\), to confirm that \(\theta^*\) is a local
minimum, following two conditions must be satisfied:
Usually we design models so that their training objectives are convex because in the convex optimization problem,
if the local minimum exists, it is actually the "global" minimum.
A set \(\mathcal{S}\) is a convex if for any \(x, x' \in \mathcal{S}\),
\[
\lambda x + (1 - \lambda) x' \in \mathcal{S}, \, \forall \lambda \in [0, 1].
\]
A function \(f(x)\) is said to be a convex function if it is defined on a convex set and if for any \(x, y \in \mathcal{S}\) and for any
\(0 \leq \lambda \leq 1\),
\[
f(\lambda x + (1 - \lambda)y) \leq \lambda f(x) + (1 - \lambda) f(y).
\]
Theorem 1:
Let \(f: \mathbb{R} \to \mathbb{R}\) be a twice differentiable function.
Then \(f\) is a convex function if and only if \(f'' \geq 0\) for all \(x \in \mathbb{R}\).
Proof:
Suppose that \(\forall x \in \mathbb{R}\), a function \(f: \mathbb{R} \to \mathbb{R}\) is twice differentiable and
\(f'' \geq 0\).
By Taylor's theorem,
\(\forall z, w \in \mathbb{R}, \, \exists c \in \mathbb{R} \) such that
\[
f(w) = f(z) + f'(z)(w -z) + f''(c) \frac{(w-z)^2}{2}.
\]
This implies
\[
f(w) \geq f(z) + f'(z)(w-z). \tag{1}
\]
Here, consider two point \(a, b \in \mathbb{R}\), and let \(z = \lambda a + (1-\lambda)b, \, \lambda \in [0, 1]\). By inequality (1),
for each point, we have
\[
\begin{align*}
&f(a) \geq f(z) + f'(z)(a-z) \\\\
&\Longrightarrow \lambda f(a) \geq \lambda f(z) + \lambda f'(z)(a-z) \tag{2}
\end{align*}
\]
and
\[
\begin{align*}
&f(b) \geq f(z) + f'(z)(b-z) \\\\
&\Longrightarrow (1-\lambda )f(b) \geq (1- \lambda )f(z) + (1-\lambda )f'(z)(b-z) \tag{3}
\end{align*}
\]
By (2) + (3), we obtain:
\[
\begin{align*}
\lambda f(a) + (1-\lambda )f(b) &\geq f(z) + \lambda f'(z)(a-z) (1-\lambda )f(z) + (1-\lambda )f'(z)(b-z) \\\\
&= f(z) + f'(z)(\lambda a + b -\lambda a - b + \lambda b - \lambda b) \\\\
&= f(z)
\end{align*}
\]
Thus,
\[
\lambda f(a) + (1-\lambda )f(b) \geq f(\lambda a + (1-\lambda)b).
\]
By definition of the convex function, \(f\) is convex.
Next, assume that \(f\) is convex. Let \(a < b\), then since \(f\) is convex, for all \(x\),
\[
f(x) \geq f(a) + f'(a)(x -a)
\]
Substituting \(x = b\),
\[
\begin{align*}
&f(b) \geq f(a) + f'(a)(b -a) \\\\
&\Longrightarrow f'(a) \leq \frac{f(b) - f(a)}{b - a} \tag{4}
\end{align*}
\]
and similarly,
\[
f(x) \geq f(b) + f'(b)(x -b)
\]
Substituting \(x = a\),
\[
\begin{align*}
&f(a) \geq f(b) + f'(b)(a -b) \\\\
&\Longrightarrow f'(b) \geq \frac{f(b) - f(a)}{b - a} \tag{5}
\end{align*}
\]
(Note: Since a < b, (a-b) < 0.)
From (4) and (5), we get
\[
f'(a) \leq \frac{f(b) - f(a)}{b - a} \leq f'(b).
\]
Thus, \(f'\) is monotonically increasing and therefore, \(f'' \geq 0\).
You can see convex functions everywhere in machine learning. For example,
the cross-entropy loss function in classification, and the
ReLU (Rectified Linear Unit), which is a popular activation function in neural networks are convex.
For \(n\) dimensional case, following theorem is a fundamental aspect of optimization problems.
Theorem 2:
Suppose \(f: \mathbb{R}^n \to \mathbb{R}\) is \(C^2\). Then \(f\) is convex if and only if the Hessian matrix
\(H = \nabla^2 f(x)\) is positive semidefinite for all \(x \in dom(f)\). Furthermore, \(f\) is strictly convex if \(H\)
is positive definite.
Proof:
Suppose for any \(x, y \in \mathbb{R}^n\) and \(\lambda \in (0, 1)\), we define \(h: [0, 1] \to \mathbb{R}\)
as follows:
\[
h(\lambda) = f(\lambda a + (1-\lambda)b).
\]
For all \(z, w, p \in [0, 1]\), let \(\lambda = pz + (1-p)w\). Then
\[
\begin{align*}
h(\lambda) &= f((pz + (1-p)w) a + (1-(pz + (1-p)w))b) \\\\
&\leq ph(z) + (1-p)h(w).
\end{align*}
\]
Thus, \(h\) is a convex function.
Rewriting \(h\), we have
\[
h(\lambda) = f(b + \lambda(a-b)).
\]
Taking the second derivative with respect to \(\lambda\), we obtain:
\[
\begin{align*}
&\frac{dh}{d\lambda} = (\nabla f(b + \lambda (a - b)))^T (a - b) \\\\
&\frac{d^2h}{d\lambda^2} = (a - b)^T (\nabla^2 f(b + \lambda(a - b))) (a - b)
\end{align*}
\]
Here, \(\nabla^2 f(x)\) is the Hessian matrix of \(f\) and \(\frac{d^2h}{d\lambda^2}\) is in quadratic form.
For \(h\) to be a convex function, we must have:
\[
\frac{d^2h}{d\lambda^2} \geq 0 \quad \forall \lambda \in [0, 1].
\]
This implies the Hessian matrix is positive semidefinite for all \(x\).
For the sake of space, we omit the full proof.
The relationship between the Hessian matrix and convexity is fundamental to many optimization algorithms and
is widely applied in both theoretical and practical optimization problems.
Gradient Descent
To find a local minimum of an objective function, we update the current point by moving in the direction that
the negative gradient, \(- \nabla f\). This is because the gradient, \(\nabla f\) indicates
the direction of steepest increase of the function \(f\). By contrast, the negative gradient points in the
direction of the steepest descent of \(f\). Thus, "iteratively" moving in the direction
of \(- \nabla f\) reduces the value of \(f\) and the value will converge to a local minimum.
Algorithm 1: GRADIENT_DESCENTInput: objective function \(f\), tolerance \(\epsilon\), learning rate \(\eta\);
Output: stationary point \(\theta^*\);
begin
\(k \leftarrow 0\);
Choose an initial point \(\theta^{(0)}\);
repeat:
Compute gradient: \(d^{(k)} = - \nabla f(\theta^{(k)});\)
Update parameters:\(\theta^{(k+1)} = \theta^{(k)} + \eta d^{(k)};\)
\( k \leftarrow k + 1 ;\)
until \(\| \nabla f(\theta^{(k)}) \| < \epsilon\);
Output \(\theta^{(k)}\);
end
Note: The gradient descent is called the first-order method since it requires only the gradient.
Stochastic Gradient Descent
In a stochastic optimization, we minimize the average value of an objective function:
\[
\mathcal{L}(\theta) = \mathbb{E }_{q(z)}[\mathcal{L}(\theta, z)]
\]
where \(z\) is a random input to the objective function. \(z\) can be a training exampple or just a random noise term.
At each iteration, we update:
\[
\theta^{(k+1)} = \theta^{(k)} - \eta^{(k)} \nabla \mathcal{L}(\theta^{(k)}, z^{(k)}).
\]
This this method is called the stochastic gradient descent
Note: Assuming the distribution \(q(z)\) is independent of the parameters we want to optimize.
On both GD and SGD, a critical issue is that at each iteration, we have to compute the gradietnt with respect to
all data points. If the given data set is huge, the algorithm becomes too expensive.
Instead, we introduce the mini-batch. Typical mini-batch size can be \(B = 32, 64, 128, \cdots\),
depending on the data size \(N\) and we compute an "approximate" gradient using only a small mini-batch(a subset of data). Intuitively,
we quickly find a "good enough" direction to improve the objective.
Algorithm 2: MINI_BATCH_SGDInput: dataset \(X\), objective function \(f\), tolerance \(\epsilon\), batch size \(B\), learning rate \(\eta\), max_epoch;
Output: stationary point \(\theta^*\);
begin
Set \(k \leftarrow 0\);
Choose an initial point \(\theta^{(0)}\);
repeat:
Shuffle the dataset \(X\) randomly;
Divide \(X\) into mini-batches \(\{B_1, B_2, \cdots, B_m\}\) each of size \(B\);
for each mini-batch \(B_i\):
Compute gradient: \(d^{(k)} = - \frac{1}{|B|} \sum_{x \in B_i} \nabla f(x; \theta^{(k)});\)
Update parameters: \(\theta^{(k+1)} = \theta^{(k)} + \eta d^{(k)};\)
end for
Set \( k \leftarrow k + 1;\)
until \(\| d^{(k)} \| < \epsilon\) or \(k \geq \) max_epoch;
Output \(\theta^{(k)}\);
end
Note: After a training epoch, we shuffle the dataset to generate different mini-batchies.
A sample code for mini-batch SGD in linear regression as follows:
import numpy as np
# Fixed parameters for mini-batch SGD
MAX_EPOCH = 10000
BATCH_SIZE = 64
TOLERANCE = 1e-3
LEARNING_RATE = 0.01
# Sample data dimensions
N_SAMPLES = 10000
N_FEATURES = 3
# Mini-batch Stochastic Gradient Descent
def mini_batch_sgd(X, y):
n_samples = X.shape[0]
theta = np.random.randn(X.shape[1]) * 0.01 # initial theta
for i in range(MAX_EPOCH):
# Shuffling Data
indices = np.random.permutation(n_samples)
x_shuffled = X[indices]
y_shuffled = y[indices]
for j in range(0, n_samples, BATCH_SIZE):
end = j + BATCH_SIZE
x_batch = x_shuffled[j:end]
y_batch = y_shuffled[j:end]
# gradient function
g = grad_f(theta, x_batch, y_batch)
# Update parameters
theta -= LEARNING_RATE * g
# Check convergence
if np.linalg.norm(g) < TOLERANCE:
print(f"Converged in {i + 1} epochs.")
break
return theta
# Gradient of objective function for linear regression (gradient of Mean Squared Error (MSE))
def grad_f(theta, x_batch, y_batch):
grad = (1 /(x_batch.shape[0])) * x_batch.T @ ((x_batch @ theta) - y_batch)
return grad
# Main
if __name__ == "__main__":
# Generate sample data and objective
X = np.random.rand(N_SAMPLES, N_FEATURES)
true_theta = np.array([5, -1, -9])
y = X @ true_theta + np.random.randn(X.shape[0]) * 0.1 # Add noise
# Run Mini-Batch SGD
estimated_theta = mini_batch_sgd(X, y)
# Calculate relative error
relative_error = np.linalg.norm(estimated_theta - true_theta) / np.linalg.norm(true_theta)
print("True Parameters: ", true_theta)
print("Estimated Parameters: ", estimated_theta)
print(f"Relative Error: {relative_error:.6f}")
Sub-gradient Descent
Even if a objective funtion has local discontinuities(non-smooth), still we can compute its gradients. Particularly, for
a convex function \(f: \mathbb{R}^n \to \mathbb{R}\), we define a subgradient of \(f\) at
\(x \in dom(f)\):
\[
f(z) \geq f(x) g^T (z - x) \tag{6}
\]
for all \(z \in dom(f)\).
A function \(f\) is subdifferentiable at \(x\) if there exists a subgradient \(g \in \mathbb{R^n}\) at
\(x\). For example, \(f(x) = |x|\) is not differentiable at \(x = 0\), but it is subdifferentiable as follows:
\[
\partial f(x) = \begin{cases}
\{-1\} &\text{ if } x < 0 \\
[-1, 1] &\text{ if } x = 0 \\
\{1\} &\text{ if } x > 0
\end{cases}.
\]
In sub-gradient descent, we just find a gradient that satisfies inequality (6) but it is slower
than GD on a convex smooth function because the gradient does not get smaller near the global minimum.