Gradient Descent First-order Optimization Techniques

Introduction to Optimization Convexity Gradient Descent Stochastic Gradient Descent Sub-gradient Descent

Introduction to Optimization

In machine learning, parameter estimation also known as model fitting requires solving optimization problem: \[ \theta^* \in \arg \min_{\theta \in \Theta} \mathcal{L}(\theta), \qquad \Theta \subset \mathbb{R}^d \] where \(\mathcal{L}(\theta)\) is a loss function(objective function), \(\Theta\) is a parameter space, and \(d\) is the number of variables being optimized over.

Since finding a global optimum is too expensive or impossible in practice, our target will be a local optimum.
A point \(\theta^*\) is a local minimum if:
\[ \exists \, \delta > 0, \, \forall \, \theta \in \Theta \text{ s.t. } \| \theta - \theta^* \| < \delta, \, \mathcal{L}(\theta^*) \leq \mathcal{L}(\theta). \] For a continuously twice differentiable function \(\mathcal{L}(\theta)\), to confirm that \(\theta^*\) is a local minimum, following two conditions must be satisfied:

  1. The gradient vector is equal to zero.
  2. \[ g(\theta^*) = \nabla \mathcal{L}(\theta^*) = 0 \]
  3. The Hessian matrix is positive definite.
  4. \[ H(\theta^*) = \nabla^2 \mathcal{L}(\theta^*) \succ 0 \]

Convexity

Usually we design models so that their training objectives are convex because in the convex optimization problem, if the local minimum exists, it is actually the "global" minimum.

A set \(\mathcal{S}\) is a convex if for any \(x, x' \in \mathcal{S}\), \[ \lambda x + (1 - \lambda) x' \in \mathcal{S}, \, \forall \lambda \in [0, 1]. \] A function \(f(x)\) is said to be a convex function if it is defined on a convex set and if for any \(x, y \in \mathcal{S}\) and for any \(0 \leq \lambda \leq 1\), \[ f(\lambda x + (1 - \lambda)y) \leq \lambda f(x) + (1 - \lambda) f(y). \]

Theorem 1: Let \(f: \mathbb{R} \to \mathbb{R}\) be a twice differentiable function. Then \(f\) is a convex function if and only if \(f'' \geq 0\) for all \(x \in \mathbb{R}\).
Proof: Suppose that \(\forall x \in \mathbb{R}\), a function \(f: \mathbb{R} \to \mathbb{R}\) is twice differentiable and \(f'' \geq 0\).
By Taylor's theorem, \(\forall z, w \in \mathbb{R}, \, \exists c \in \mathbb{R} \) such that \[ f(w) = f(z) + f'(z)(w -z) + f''(c) \frac{(w-z)^2}{2}. \] This implies \[ f(w) \geq f(z) + f'(z)(w-z). \tag{1} \] Here, consider two point \(a, b \in \mathbb{R}\), and let \(z = \lambda a + (1-\lambda)b, \, \lambda \in [0, 1]\). By inequality (1), for each point, we have \[ \begin{align*} &f(a) \geq f(z) + f'(z)(a-z) \\\\ &\Longrightarrow \lambda f(a) \geq \lambda f(z) + \lambda f'(z)(a-z) \tag{2} \end{align*} \] and \[ \begin{align*} &f(b) \geq f(z) + f'(z)(b-z) \\\\ &\Longrightarrow (1-\lambda )f(b) \geq (1- \lambda )f(z) + (1-\lambda )f'(z)(b-z) \tag{3} \end{align*} \] By (2) + (3), we obtain: \[ \begin{align*} \lambda f(a) + (1-\lambda )f(b) &\geq f(z) + \lambda f'(z)(a-z) (1-\lambda )f(z) + (1-\lambda )f'(z)(b-z) \\\\ &= f(z) + f'(z)(\lambda a + b -\lambda a - b + \lambda b - \lambda b) \\\\ &= f(z) \end{align*} \] Thus, \[ \lambda f(a) + (1-\lambda )f(b) \geq f(\lambda a + (1-\lambda)b). \] By definition of the convex function, \(f\) is convex.

Next, assume that \(f\) is convex. Let \(a < b\), then since \(f\) is convex, for all \(x\), \[ f(x) \geq f(a) + f'(a)(x -a) \] Substituting \(x = b\), \[ \begin{align*} &f(b) \geq f(a) + f'(a)(b -a) \\\\ &\Longrightarrow f'(a) \leq \frac{f(b) - f(a)}{b - a} \tag{4} \end{align*} \] and similarly, \[ f(x) \geq f(b) + f'(b)(x -b) \] Substituting \(x = a\), \[ \begin{align*} &f(a) \geq f(b) + f'(b)(a -b) \\\\ &\Longrightarrow f'(b) \geq \frac{f(b) - f(a)}{b - a} \tag{5} \end{align*} \] (Note: Since a < b, (a-b) < 0.)
From (4) and (5), we get \[ f'(a) \leq \frac{f(b) - f(a)}{b - a} \leq f'(b). \] Thus, \(f'\) is monotonically increasing and therefore, \(f'' \geq 0\).

You can see convex functions everywhere in machine learning. For example, the cross-entropy loss function in classification, and the ReLU (Rectified Linear Unit), which is a popular activation function in neural networks are convex.

For \(n\) dimensional case, following theorem is a fundamental aspect of optimization problems.
Theorem 2: Suppose \(f: \mathbb{R}^n \to \mathbb{R}\) is \(C^2\). Then \(f\) is convex if and only if the Hessian matrix \(H = \nabla^2 f(x)\) is positive semidefinite for all \(x \in dom(f)\). Furthermore, \(f\) is strictly convex if \(H\) is positive definite.
Proof: Suppose for any \(x, y \in \mathbb{R}^n\) and \(\lambda \in (0, 1)\), we define \(h: [0, 1] \to \mathbb{R}\) as follows: \[ h(\lambda) = f(\lambda a + (1-\lambda)b). \] For all \(z, w, p \in [0, 1]\), let \(\lambda = pz + (1-p)w\). Then \[ \begin{align*} h(\lambda) &= f((pz + (1-p)w) a + (1-(pz + (1-p)w))b) \\\\ &\leq ph(z) + (1-p)h(w). \end{align*} \] Thus, \(h\) is a convex function.
Rewriting \(h\), we have \[ h(\lambda) = f(b + \lambda(a-b)). \] Taking the second derivative with respect to \(\lambda\), we obtain: \[ \begin{align*} &\frac{dh}{d\lambda} = (\nabla f(b + \lambda (a - b)))^T (a - b) \\\\ &\frac{d^2h}{d\lambda^2} = (a - b)^T (\nabla^2 f(b + \lambda(a - b))) (a - b) \end{align*} \] Here, \(\nabla^2 f(x)\) is the Hessian matrix of \(f\) and \(\frac{d^2h}{d\lambda^2}\) is in quadratic form.
For \(h\) to be a convex function, we must have: \[ \frac{d^2h}{d\lambda^2} \geq 0 \quad \forall \lambda \in [0, 1]. \] This implies the Hessian matrix is positive semidefinite for all \(x\).

For the sake of space, we omit the full proof.
The relationship between the Hessian matrix and convexity is fundamental to many optimization algorithms and is widely applied in both theoretical and practical optimization problems.

Gradient Descent

To find a local minimum of an objective function, we update the current point by moving in the direction that the negative gradient, \(- \nabla f\). This is because the gradient, \(\nabla f\) indicates the direction of steepest increase of the function \(f\). By contrast, the negative gradient points in the direction of the steepest descent of \(f\). Thus, "iteratively" moving in the direction of \(- \nabla f\) reduces the value of \(f\) and the value will converge to a local minimum.

Algorithm 1: GRADIENT_DESCENT Input: objective function \(f\), tolerance \(\epsilon\), learning rate \(\eta\); Output: stationary point \(\theta^*\); begin  \(k \leftarrow 0\);  Choose an initial point \(\theta^{(0)}\);  repeat:    Compute gradient: \(d^{(k)} = - \nabla f(\theta^{(k)});\)    Update parameters:\(\theta^{(k+1)} = \theta^{(k)} + \eta d^{(k)};\)    \( k \leftarrow k + 1 ;\)  until \(\| \nabla f(\theta^{(k)}) \| < \epsilon\);  Output \(\theta^{(k)}\); end
Note: The gradient descent is called the first-order method since it requires only the gradient.

Stochastic Gradient Descent

In a stochastic optimization, we minimize the average value of an objective function: \[ \mathcal{L}(\theta) = \mathbb{E }_{q(z)}[\mathcal{L}(\theta, z)] \] where \(z\) is a random input to the objective function. \(z\) can be a training exampple or just a random noise term. At each iteration, we update: \[ \theta^{(k+1)} = \theta^{(k)} - \eta^{(k)} \nabla \mathcal{L}(\theta^{(k)}, z^{(k)}). \] This this method is called the stochastic gradient descent
Note: Assuming the distribution \(q(z)\) is independent of the parameters we want to optimize.

On both GD and SGD, a critical issue is that at each iteration, we have to compute the gradietnt with respect to all data points. If the given data set is huge, the algorithm becomes too expensive. Instead, we introduce the mini-batch. Typical mini-batch size can be \(B = 32, 64, 128, \cdots\), depending on the data size \(N\) and we compute an "approximate" gradient using only a small mini-batch(a subset of data). Intuitively, we quickly find a "good enough" direction to improve the objective.

Algorithm 2: MINI_BATCH_SGD Input: dataset \(X\), objective function \(f\), tolerance \(\epsilon\), batch size \(B\), learning rate \(\eta\), max_epoch; Output: stationary point \(\theta^*\); begin  Set \(k \leftarrow 0\);  Choose an initial point \(\theta^{(0)}\);  repeat:   Shuffle the dataset \(X\) randomly;   Divide \(X\) into mini-batches \(\{B_1, B_2, \cdots, B_m\}\) each of size \(B\);   for each mini-batch \(B_i\):     Compute gradient: \(d^{(k)} = - \frac{1}{|B|} \sum_{x \in B_i} \nabla f(x; \theta^{(k)});\)     Update parameters: \(\theta^{(k+1)} = \theta^{(k)} + \eta d^{(k)};\)   end for   Set \( k \leftarrow k + 1;\)  until \(\| d^{(k)} \| < \epsilon\) or \(k \geq \) max_epoch;  Output \(\theta^{(k)}\); end
Note: After a training epoch, we shuffle the dataset to generate different mini-batchies.
A sample code for mini-batch SGD in linear regression as follows:
                    import numpy as np

                    # Fixed parameters for mini-batch SGD
                    MAX_EPOCH = 10000
                    BATCH_SIZE = 64
                    TOLERANCE = 1e-3
                    LEARNING_RATE = 0.01

                    # Sample data dimensions
                    N_SAMPLES = 10000
                    N_FEATURES = 3

                    # Mini-batch Stochastic Gradient Descent
                    def mini_batch_sgd(X, y):
                        n_samples = X.shape[0]
                        theta = np.random.randn(X.shape[1]) * 0.01  # initial theta
                        
                        for i in range(MAX_EPOCH):
                            # Shuffling Data 
                            indices = np.random.permutation(n_samples)
                            x_shuffled = X[indices]
                            y_shuffled = y[indices]
                            
                            for j in range(0, n_samples, BATCH_SIZE):
                                end = j + BATCH_SIZE
                                x_batch = x_shuffled[j:end]
                                y_batch = y_shuffled[j:end]
                                
                                # gradient function 
                                g = grad_f(theta, x_batch, y_batch)
                                
                                # Update parameters
                                theta -= LEARNING_RATE * g
                            
                            # Check convergence
                            if np.linalg.norm(g) < TOLERANCE:
                                print(f"Converged in {i + 1} epochs.")
                                break
                        return theta

                    # Gradient of objective function for linear regression (gradient of Mean Squared Error (MSE))
                    def grad_f(theta, x_batch, y_batch):
                            grad = (1 /(x_batch.shape[0])) * x_batch.T @ ((x_batch @ theta) - y_batch)
                            return grad
                        
                    # Main 
                    if __name__ == "__main__":
                        
                        # Generate sample data and objective 
                        X = np.random.rand(N_SAMPLES, N_FEATURES) 
                        true_theta = np.array([5, -1, -9])
                        y = X @ true_theta + np.random.randn(X.shape[0]) * 0.1  # Add noise
                        
                        # Run Mini-Batch SGD
                        estimated_theta = mini_batch_sgd(X, y)
                        
                        # Calculate relative error
                        relative_error = np.linalg.norm(estimated_theta - true_theta) / np.linalg.norm(true_theta)
                        
                        print("True Parameters: ", true_theta)
                        print("Estimated Parameters: ", estimated_theta)
                        print(f"Relative Error: {relative_error:.6f}")
                    

Sub-gradient Descent

Even if a objective funtion has local discontinuities(non-smooth), still we can compute its gradients. Particularly, for a convex function \(f: \mathbb{R}^n \to \mathbb{R}\), we define a subgradient of \(f\) at \(x \in dom(f)\): \[ f(z) \geq f(x) g^T (z - x) \tag{6} \] for all \(z \in dom(f)\).
A function \(f\) is subdifferentiable at \(x\) if there exists a subgradient \(g \in \mathbb{R^n}\) at \(x\). For example, \(f(x) = |x|\) is not differentiable at \(x = 0\), but it is subdifferentiable as follows: \[ \partial f(x) = \begin{cases} \{-1\} &\text{ if } x < 0 \\ [-1, 1] &\text{ if } x = 0 \\ \{1\} &\text{ if } x > 0 \end{cases}. \] In sub-gradient descent, we just find a gradient that satisfies inequality (6) but it is slower than GD on a convex smooth function because the gradient does not get smaller near the global minimum.