Newton's Method

Line Search

In the previous chapter, we used a "fixed" learning rate(or step size), which was chosen based on experimental results. However, the "optimal" step size can be determined by line search: \[ \eta_t = \arg \min _{\eta > 0} \mathcal{L} (\theta_t + \eta d_t) \in \mathbb{R}. \] Consider a quadratic loss: \[ \mathcal{L}(\theta + \eta \, d) = \frac{1}{2}(\theta+ \eta d)^\top A (\theta + \eta \, d) + b^\top (\theta + \eta d) + c. \] where \(\theta, d \in \mathbb{R}^n\), and \(\eta, c \in \mathbb{R}\).
Taking derivative of this function: \[ \begin{align*} \frac{d \mathcal{L}(\theta + \eta d)}{d\eta} &= \frac{1}{2} d^\top 2A(\theta + \eta d) + d^\top b \\\\ &= d^\top (A\theta + b) + \eta d^\top A d \end{align*} \] Setting the derivative equal to zero, we have: \[ \eta = - \frac{d^\top (A\theta + b)}{d^\top A d}. \] In practice, we don't need the "exact" line serach because it can be expensive to solve this sub-optimization problem "at each step."
For example, we can start with some initial step size, and then reduce it by a factor \(c \in (0, 1)\) at each iteration until we satisfy the following condition: \[ \mathcal{L}(\theta_t + \eta \, d_t) \leq \mathcal{L}(\theta_t) + c \, \eta \, d_t^\top \nabla \mathcal{L}(\theta_t) \tag{1} \] This is called Armijo condition(Sufficient Decrease condition) that ensures sufficient decreasing of our objective function.
Note: Usually, we set \(c = 10^{-4}\).

Even though the first-order methods such as the gradient descent is computationally cheap, they do not consider the curvature, which often leads to slower convergence. A classic second-order method, Newton's method is as follows:

Algorithm 1: NEWTONS_METHOD Input: objective function \(\mathcal{L}\), tolerance \(\epsilon\), initial step size \(\eta_o\); Output: stationary point \(\theta^*\); begin \(t \leftarrow 0\); Choose an initial point \(\theta_o\); repeat: Compute gradient: \(g_t = \nabla \mathcal{L}(\theta_t);\) Compute Hessian: \(H_t = \nabla^2 \mathcal{L}(\theta_t);\) Solve \(H_t d_t = -g_t\) for \(d_t\); \(d_t \leftarrow -H_t^{-1}g_t\); Do LINE_SEARCH to get step size \(\eta_t\) along \(d_t\); Update parameters: \(\theta_{t+1} = \theta_t + \eta_t d_t;\) \( t \leftarrow t + 1 ;\) until \(\| g_t \| < \epsilon\); Output \(\theta_t\); end

A critical issue of Newton's method is computing the inverse Hessian \(H_t^{-1}\) at each \(t\) step, which is obviously expensive.

BFGS Method

In Quasi-Newton methods, we approximate the Hessian using the curvature information from the gradient vector at each step. The most common Quasi-Newton method is called BFGS(Broyden, Fletcher, Goldfarb and Shanno) method.

Approximation of Hessian in BFGS \[ \begin{align*} H_{t+1} &\approx B_{t+1} \\\\ &= B_{t} + \frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t} \end{align*} \] where \(s_t = \theta_t - \theta_{t-1}\) is the step in the parameter space and \(y_t = g_t - g_{t-1}\) is the change in the gradient of objective \(\mathcal{L}\).
Alternatively, we can iteratively update an approximation to the inverse Hessian: \[ \begin{align*} H_{t+1}^{-1} &\approx C_{t+1} \\\\ &= \Big(I - \frac{s_t y_t^\top}{y_t^\top s_t} \Big) C_t \Big( I - \frac{y_t s_t^\top}{y_t^\top s_t}\Big) + \frac{s_t s_t^\top}{y_t^\top s_t}. \end{align*} \]

By the Taylor expansion of the gradient \(g(\theta) = \nabla \mathcal{L}(\theta)\) around \(\theta_t\), neglecting higher-order terms: \[ \begin{align*} &g(\theta_{t+1}) \approx g(\theta_t) + H(\theta_t) (\theta_{t+1}-\theta_t) \\\\ &\Longrightarrow y_t = H_t s_t, \end{align*} \] where \[ \begin{align*} &y_t = g(\theta_{t+1}) - g(\theta_t) \\\\ &s_t = \theta_{t+1}-\theta_t \end{align*} \]
To efficiently approximate the Hessian \(H_t\), we compute the updated Hessian approximation \(B_{t+1}\) at the end of step \(t\). It must satisfy the secant condition: \[ B_{t+1} s_t = y_t. \]
This condition ensures that the updated Hessian approximation \(B_{t+1} \) reflects the curvature relationship between the step \(s_t\) and the gradient change \(y_t\). (Note: \(y_t^\top s_t >0\))
Assuming \(B_{t+1}\) is a positive definite rank-2 update of \(B_t\): \[ B_{t+1} = B_t + U, \] the secant condition becomes: \[ (B_t + U) s_t = y_t \Longrightarrow U s_t = y_t - B_t s_t. \] Here, the correction matrix \(U\) must account for the discrepancy \(y_t - B_t s_t\).
We assume \(U\) takes the form: \[ U = \frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t}. \] The first term adds curvature in the direction of \(y_t\), normalized by \(y_t^\top s_t\), ensuring positive definiteness. The second term removes excess curvature in the direction of \(s_t\) ensuring the secant condition is satisfied without over-correction.
Verification: \[ \begin{align*} &B_{t+1} = B_t + \frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t} \\\\ &\Longrightarrow B_{t+1} s_t = B_t s_t + \Big\{\frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t}\Big\} s_t\\\\ &\Longrightarrow B_{t+1} s_t = B_t s_t + y_t \cdot \frac{y_t^\top s_t}{y_t^\top s_t} - B_t s_t \cdot \frac{(s_t^\top B_t s_t)}{s_t^\top B_t s_t}\\\\ &\Longrightarrow B_{t+1} s_t = B_t s_t + y_t - B_t s_t \\\\ &\Longrightarrow B_{t+1} s_t = y_t \end{align*} \] Therefore, \(B_{t+1}\) satisfies the secant condition, maintaining positive definiteness while approximating the curvature of the objective function.

Inverse Hessian Approximation:
Let \(B_t^{-1} = C_t\). First, we consider the rank-1 update: \[ B_{t+1}^{-1} = \Big(B_t + \frac{y_t y_t^\top}{y_t^\top s_t} \Big)^{-1}. \] Note: A rank-1 update modifies the current Hessian approximation by adding or subtracting a rank-1 matrix, which is the outer product of two vectors.
Using Sherman-Morrison formula: \[ \begin{align*} B_{t+1}^{-1} &= \Big(B_t + y_t \cdot \frac{y_t^\top}{y_t^\top s_t} \Big)^{-1} \\\\ &= C_t - \frac{C_t y_t \frac{y_t^\top}{y_t^\top s_t} C_t}{1 + \frac{y_t^\top}{y_t^\top s_t} C_t y_t} \\\\ &= C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t}. \end{align*} \] Next, we consider the rank-2 update: \[ C_{t+1} = \Big[\Big(B_t + \frac{y_t y_t^\top}{y_t^\top s_t}\Big) - (B_t s_t) \cdot \frac{(B_t s_t)^\top}{s_t^\top B_t s_t} \Big]^{-1}. \] Again using Sherman-Morrison formula: \[ \begin{align*} C_{t+1} &= \Big(C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t}\Big) + \frac{\Big(C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t}\Big)(B_t s_t) \frac{(B_t s_t)^\top}{s_t^\top B_t s_t} \Big(C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t}\Big)} {1 - \frac{(B_t s_t)^\top}{s_t^\top B_t s_t} \Big(C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t}\Big)(B_t s_t) } \\\\ &= C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t} + \frac{\Big(C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t}\Big) B_t s_t s_t^\top B_t \Big(C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t}\Big)} {s_t^\top B_t s_t - s_t^T B_t \Big(C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t}\Big) B_t s_t } \\\\ &= C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t} + \frac{\Big(I - \frac{C_t y_t y_t^\top}{y_t^\top s_t + y_t^\top C_t y_t}\Big) s_t s_t^\top \Big(I - \frac{y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t}\Big)} {s_t^\top B_t s_t - s_t^\top B_t s_t + \frac{(y_t^\top s_t)^2}{y_t^\top s_t + y_t^\top C_t y_t}}\\\\ &= C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t} + \frac{s_t s_t^\top - \frac{y_t^\top s_t}{y_t^\top s_t + y_t^\top C_t y_t}(s_t y_t^\top C_t + C_t y_t s_t^\top) +\Big(\frac{y_t^\top s_t}{y_t^\top s_t + y_t^\top C_t y_t}\Big)^2 C_t y_t y_y^\top C_t} {\frac{(y_t^\top s_t)^2}{y_t^\top s_t + y_t^\top C_t y_t}}\\\\ &= C_t - \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t} + \frac{s_t s_t^\top( y_t^\top s_t + y_t^\top C_t y_t) }{(y_t^\top s_t)^2} - \frac{s_t y_t^\top C_t + C_t y_t s_t^\top}{y_t^\top s_t} + \frac{C_t y_t y_t^\top C_t}{y_t^\top s_t + y_t^\top C_t y_t} \\\\ &= C_t + \frac{s_t s_t^\top}{y_t^\top s_t} + \frac{s_t y_t^\top C_t y_t s_t^\top}{(y_t^\top s_t)^2} - \frac{s_t y_t^\top C_t}{y_t^\top s_t} - \frac{C_t y_t s_t^\top}{y_t^\top s_t} \\\\ &= \Big(I -\frac{s_t y_t^\top}{y_t^\top s_t}\Big)C_t\Big(I - \frac{y_t s_t^\top}{y_t^\top s_t}\Big) + \frac{s_t s_t^\top}{y_t^\top s_t} \end{align*} \]

If the initial \(B_o\) is positive definite(Usually, \(B_o = I\)), and the step size \(\eta\) is found via Condition (1) and the following curvature condition: \[ d_t^\top \nabla \mathcal{L}(\theta_t + \eta \, d_t) \geq c_2 \, \eta \, d_t^\top \nabla \mathcal{L}(\theta_t) \tag{2} \] where \(c_2 \in (c, 1)\).
Then \(B_{t+1}\) remains positive definite. Condition (1) and (2) are together called Wolfe conditions.

Theorem 1: Wolfe conditions

Sufficient decrease condition
Curvature condition

strong Wolfe conditions

In practice, we cannot store the whole Hessian approximation for large-scale problems. It costs \(O(n^2)\) memory space. In Limited memory BFGS(L-BFGS), we only use the \(m\) most recent \((s_t, y_t)\) pairs, and do not store the Hessian approximation explicitly. So, we can approximate \(H_t^{-1 }g_t\) by computing a sequence of inner products of these vectors. The memory requirement can be \(O(mn)\), where \(m\) is typically 5 to 20.

Algorithm 2: L_BFGS Input: objective function \(\mathcal{L}\), tolerance \(\epsilon\), initial step size \(\eta_o\); Output: stationary point \(\theta^*\); begin Choose an initial point \(\theta_o\); Initialize \( (s, y) \) storage for \( m \) past updates; \(t \leftarrow 0\); repeat: Compute gradient \( g_t = \nabla \mathcal{L}(\theta_t) \); Set \( q_t \leftarrow g_t \); Loop backward (from latest stored \( (s, y) \) pairs): Compute \( \alpha_i = \frac{s_i^\top q_t}{y_i^\top s_i} \); Update \( q_t \leftarrow q_t - \alpha_i y_i \); Set \( r_t \leftarrow H_t^0 q_t \); Loop forward (from oldest stored \( (s, y) \) pairs): Compute \( \beta_i = \frac{y_i^\top r_t}{y_i^\top s_i} \); Update \( r_t \leftarrow r_t + s_i (\alpha_i - \beta_i) \); Set serch direction \( p_t \leftarrow -r_t \); Do LINE_SEARCH: find step size \( \eta_t \) along \(p_t\) s.t. \( \mathcal{L}(\theta_t + \eta_t p_t) < \mathcal{L}(\theta_t) \); Update parameters: \( \theta_{t+1} \leftarrow \theta_t + \eta_t p_t \); Update gradient \( g_{t+1} \leftarrow \nabla \mathcal{L}(\theta_{t+1}) \); Store \( (s_t\leftarrow \theta_{t+1} - \theta_t,\, y_t \leftarrow g_{t+1} - g_t) \); Maintain memory size:If the number of stored \( s, y \) pairs exceeds \( m \), discard the oldest pair; \( t \leftarrow t + 1\); until \( \|g_t\| < \epsilon \); Output \(\theta_t\); end

Note: For computational efficiency and numerical stability, instead of calculating \(\frac{1}{s^\top y}\) multiple times, usually a new variable \(\rho = \frac{1}{s^\top y}\) will be introduced.

Python Implementation

A sample code for L-BFGS is as follows:

                            import numpy as np

                            # Line search with Wolfe conditions 
                            def line_search(f, grad_f, theta, p, c1 = 1e-4, c2 = 0.9, max_iter = 100):
                                eta = 1.0
                                eta_low = 0.0
                                eta_high = None
                            
                                phi_0 = f(theta)
                                grad_phi_0 = np.dot(grad_f(theta), p)
                            
                                for _ in range(max_iter):
                                    phi_eta = f(theta + eta * p)
                            
                                    # Check Armijo condition
                                    if phi_eta > phi_0 + c1 * eta * grad_phi_0:
                                        eta_high = eta
                                    else:
                                        # Check Curvature condition
                                        grad_phi_eta = np.dot(grad_f(theta + eta * p), p)
                                        if grad_phi_eta < c2 * grad_phi_0:
                                            eta_low = eta
                                        else:
                                            return eta
                            
                                    # Update step size using bisection method
                                    if eta_high is not None:
                                        eta = (eta_low + eta_high) / 2.0
                                    else:
                                        eta *= 2.0
                            
                                return eta
                            
                            # Limited memory BFGS 
                            def limited_bfgs(f, grad_f, theta0, m = 10, tol = 1e-6, max_iter = 2000):
                                
                                theta = theta0.copy()
                                g = grad_f(theta)
                                s_list = []
                                y_list = []
                                rho_list = [] # We introduce rho =  1/s^T y instead of directly using 1/s^T y for efficiency & stability. 
                            
                                for _ in range(max_iter):
                                    
                                    if np.linalg.norm(g) < tol:
                                        break
                            
                                    q = g.copy()
                                    alpha_list = []  # Need this for the second loop to avoid computing alpha again 
                            
                                    # Loop backward through stored (s, y) pairs
                                    for s, y, rho in reversed(list(zip(s_list, y_list, rho_list))):
                                        alpha = rho * np.dot(s, q)
                                        alpha_list.append(alpha)
                                        q -= alpha * y
                            
                                    # Initial Hessian approximation is identity: H0 = I, so H0 * q = q
                                    r = q
                            
                                    # Loop forward through stored (s, y) pairs
                                    for (s, y, rho), alpha in zip(zip(s_list, y_list, rho_list), reversed(alpha_list)):
                                        beta = rho * np.dot(y, r)
                                        r += s * (alpha - beta)
                            
                                    # Search direction
                                    p = -r
                            
                                    # Compute the step size by Line search satisfying Wolfe conditions
                                    eta = line_search(f, grad_f, theta, p)
                            
                                    # Update parameters
                                    theta += eta * p
                                    grad_next = grad_f(theta)
                            
                                    # Update memory for (s, y) pairs
                                    s = eta * p
                                    y = grad_next - g
                                    if np.dot(s, y) > 1e-6:  
                                        if len(s_list) == m:
                                            s_list.pop(0)
                                            y_list.pop(0)
                                            rho_list.pop(0)
                                        s_list.append(s)
                                        y_list.append(y)
                                        rho_list.append(1.0 / np.dot(y, s))
                                        
                                    # Update gradient
                                    g = grad_next
                            
                                return theta
                            
                            # Objective function and its gradient: 
                            # The Rosenbrock function is commonly used for testing optimization algorithms.
                            def rosenbrock(x):
                                return np.sum(100 * (x[1:] - x[:-1]**2)**2 + (1 - x[:-1])**2)
                            
                            def grad_rosenbrock(x):
                                grad = np.zeros_like(x)
                                grad[:-1] = -400 * x[:-1] * (x[1:] - x[:-1]**2) - 2 * (1 - x[:-1]) # # x_1 to x_n-1
                                grad[1:] += 200 * (x[1:] - x[:-1]**2) # x_2 to x_n
                                return grad
                            
                            # Finite difference gradient to check grad_rosenbrock().
                            def finite_difference_gradient(f, x, epsilon=1e-6):
                                grad = np.zeros_like(x)
                                for i in range(len(x)):
                                    x_forward = x.copy()
                                    x_backward = x.copy()
                                    x_forward[i] += epsilon
                                    x_backward[i] -= epsilon
                                    grad[i] = (f(x_forward) - f(x_backward)) / (2 * epsilon)
                                return grad
                            
                            if __name__ == "__main__":
                                
                                n = 50 # Dimensionality
                                
                                # Randomly generate an initial point x0:
                                # You could try : x0 = np.ones(n) + 0.3 * np.random.randn(n), which represents adding small 
                                # perturbation around the global minimum x* = [1, ... , 1]^T 
                                x0 =  np.random.randn(n)  
                                numeric_opt = limited_bfgs(rosenbrock, grad_rosenbrock, x0)
                                print("Initial point: \n", x0.tolist())
                                print("\n Numerical optimum: \n", numeric_opt.tolist())
                            
                                '''
                                # If you are not sure about the gradient of the objective, always you can compare it with finite difference.
                                grad_analytic = grad_rosenbrock(x0)
                                grad_numeric = finite_difference_gradient(rosenbrock, x0)
                                relative_error_grad = np.linalg.norm(grad_analytic - grad_numeric) / np.linalg.norm(grad_analytic)
                                print("Relative error:", relative_error_grad) 
                                ''' 
                                # Relative error between the numerical optimum and the global optimum
                                global_opt = np.ones(n)  # The actual global optimum of Rosenbrock function is x* = [1, ... , 1]^T
                                relative_error_optimal = np.linalg.norm(numeric_opt - global_opt) / np.linalg.norm(global_opt)
                                print(f"\n Relative error to the global minimum: {relative_error_optimal*100:.8f}%")

Note: On the above code, we used the Rosenbrock function, which is defined as \[ f(x) = \sum_{i = 1}^{n-1} a (x_{i+1} - x_i^2)^2 + (b - x_i)^2 \] where \(x \in \mathbb{R}^n\), and usually we set the coefficients \(a = 100, \, b = 1\).

The global minimum can be obtained simply at \(x = [1, 1, \cdots, 1]\), but in numerical optimization, converging to the global minimum is difficult because of several reasons:

Broad "flat" regions & several local minima: Optimization algorithms can be trapped there.
Ill-conditioned: Small changes in one variable can result in large changes in another.
A narrow, curved valley containing the global minimum: It makes hard for optimization algorithms to navigate. The steep sides of the valley can cause algorithms to overshoot or converge very slowly.

Rosenbrock function is a common benchmark for testing the robustness and efficiency of optimization algorithms.

Newton's Method Second-order Optimization Techniques

Calculus to Optimization & Analysis

Line Search

Newton's Method

BFGS Method

Python Implementation