In the previous chapter, we used a "fixed" learning rate(or step size), which was chosen
based on experimental results. However, the "optimal" step size can be determined by line search:
\[
\eta_t = \arg \min _{\eta > 0} \mathcal{L} (\theta_t + \eta d_t) \in \mathbb{R}.
\]
Consider a quadratic loss:
\[
\mathcal{L}(\theta + \eta \, d) = \frac{1}{2}(\theta+ \eta d)^\top A (\theta + \eta \, d) + b^\top (\theta + \eta d) + c.
\]
where \(\theta, d \in \mathbb{R}^n\), and \(\eta, c \in \mathbb{R}\).
Taking derivative of this function:
\[
\begin{align*}
\frac{d \mathcal{L}(\theta + \eta d)}{d\eta} &= \frac{1}{2} d^\top 2A(\theta + \eta d) + d^\top b \\\\
&= d^\top (A\theta + b) + \eta d^\top A d
\end{align*}
\]
Setting the derivative equal to zero, we have:
\[
\eta = - \frac{d^\top (A\theta + b)}{d^\top A d}.
\]
In practice, we don't need the "exact" line serach because it can be expensive to solve this sub-optimization
problem "at each step."
For example, we can start with some initial step size, and then reduce it by a factor \(c \in (0, 1)\) at each
iteration until we satisfy the following condition:
\[
\mathcal{L}(\theta_t + \eta \, d_t) \leq \mathcal{L}(\theta_t) + c \, \eta \, d_t^\top \nabla \mathcal{L}(\theta_t) \tag{1}
\]
This is called Armijo condition(Sufficient Decrease condition) that ensures sufficient
decreasing of our objective function.
Note: Usually, we set \(c = 10^{-4}\).
Newton's Method
Even though the first-order methods such as the gradient descent is computationally cheap, they do not consider the
curvature, which often leads to slower convergence. A classic second-order method, Newton's method
is as follows:
Algorithm 1: NEWTONS_METHODInput: objective function \(\mathcal{L}\), tolerance \(\epsilon\), initial step size \(\eta_o\);
Output: stationary point \(\theta^*\);
begin
\(t \leftarrow 0\);
Choose an initial point \(\theta_o\);
repeat:
Compute gradient: \(g_t = \nabla \mathcal{L}(\theta_t);\)
Compute Hessian: \(H_t = \nabla^2 \mathcal{L}(\theta_t);\)
Solve \(H_t d_t = -g_t\) for \(d_t\);
\(d_t \leftarrow -H_t^{-1}g_t\);
Do LINE_SEARCH to get step size \(\eta_t\) along \(d_t\);
Update parameters: \(\theta_{t+1} = \theta_t + \eta_t d_t;\)
\( t \leftarrow t + 1 ;\)
until \(\| g_t \| < \epsilon\);
Output \(\theta_t\);
end
A critical issue of Newton's method is computing the inverse Hessian \(H_t^{-1}\) at
each \(t\) step, which is obviously expensive.
BFGS Method
In Quasi-Newton methods, we approximate the Hessian using the curvature information from the gradient vector at each step.
The most common Quasi-Newton method is called BFGS(Broyden, Fletcher, Goldfarb and Shanno) method.
Approximation of Hessian in BFGS
\[
\begin{align*}
H_{t+1} &\approx B_{t+1} \\\\
&= B_{t} + \frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t}
\end{align*}
\]
where \(s_t = \theta_t - \theta_{t-1}\) is the step in the parameter space and \(y_t = g_t - g_{t-1}\) is the change in the gradient
of objective \(\mathcal{L}\).
Alternatively, we can iteratively update an approximation to the inverse Hessian:
\[
\begin{align*}
H_{t+1}^{-1} &\approx C_{t+1} \\\\
&= \Big(I - \frac{s_t y_t^\top}{y_t^\top s_t} \Big) C_t \Big( I - \frac{y_t s_t^\top}{y_t^\top s_t}\Big) + \frac{s_t s_t^\top}{y_t^\top s_t}.
\end{align*}
\]
By the Taylor expansion of the gradient \(g(\theta) = \nabla \mathcal{L}(\theta)\) around \(\theta_t\), neglecting higher-order terms:
\[
\begin{align*}
&g(\theta_{t+1}) \approx g(\theta_t) + H(\theta_t) (\theta_{t+1}-\theta_t) \\\\
&\Longrightarrow y_t = H_t s_t,
\end{align*}
\]
where
\[
\begin{align*}
&y_t = g(\theta_{t+1}) - g(\theta_t) \\\\
&s_t = \theta_{t+1}-\theta_t
\end{align*}
\]
To efficiently approximate the Hessian \(H_t\), we compute the updated Hessian approximation \(B_{t+1}\) at the end of step \(t\).
It must satisfy the secant condition:
\[
B_{t+1} s_t = y_t.
\]
This condition ensures that the updated Hessian approximation \(B_{t+1} \) reflects the curvature
relationship between the step \(s_t\) and the gradient change \(y_t\). (Note: \(y_t^\top s_t >0\))
Assuming \(B_{t+1}\) is a positive definite rank-2 update of \(B_t\):
\[
B_{t+1} = B_t + U,
\]
the secant condition becomes:
\[
(B_t + U) s_t = y_t \Longrightarrow U s_t = y_t - B_t s_t.
\]
Here, the correction matrix \(U\) must account for the discrepancy \(y_t - B_t s_t\).
We assume \(U\) takes the form:
\[
U = \frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t}.
\]
The first term adds curvature in the direction of \(y_t\), normalized by \(y_t^\top s_t\), ensuring positive definiteness.
The second term removes excess curvature in the direction of \(s_t\) ensuring the secant condition is satisfied without over-correction.
Verification:
\[
\begin{align*}
&B_{t+1} = B_t + \frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t} \\\\
&\Longrightarrow B_{t+1} s_t = B_t s_t + \Big\{\frac{y_t y_t^\top}{y_t^\top s_t} - \frac{(B_t s_t)(B_t s_t)^\top }{s_t^\top B_t s_t}\Big\} s_t\\\\
&\Longrightarrow B_{t+1} s_t = B_t s_t + y_t \cdot \frac{y_t^\top s_t}{y_t^\top s_t} - B_t s_t \cdot \frac{(s_t^\top B_t s_t)}{s_t^\top B_t s_t}\\\\
&\Longrightarrow B_{t+1} s_t = B_t s_t + y_t - B_t s_t \\\\
&\Longrightarrow B_{t+1} s_t = y_t
\end{align*}
\]
Therefore, \(B_{t+1}\) satisfies the secant condition, maintaining positive definiteness while approximating
the curvature of the objective function.
If the initial \(B_o\) is positive definite(Usually, \(B_o = I\)), and the step size \(\eta\) is found via Condition (1) and the following
curvature condition:
\[
d_t^\top \nabla \mathcal{L}(\theta_t + \eta \, d_t) \geq c_2 \, \eta \, d_t^\top \nabla \mathcal{L}(\theta_t) \tag{2}
\]
where \(c_2 \in (c, 1)\).
Then \(B_{t+1}\) remains positive definite. Condition (1) and (2) are together called Wolfe conditions.
In practice, we cannot store the whole Hessian approximation for large-scale problems. It costs \(O(n^2)\) memory space.
In Limited memory BFGS(L-BFGS), we only use the \(m\) most recent \((s_t, y_t)\) pairs, and do not store the Hessian approximation
explicitly. So, we can approximate \(H_t^{-1 }g_t\) by computing a sequence of inner products of these vectors. The memory requirement
can be \(O(mn)\), where \(m\) is typically 5 to 20.
Algorithm 2: L_BFGSInput: objective function \(\mathcal{L}\), tolerance \(\epsilon\), initial step size \(\eta_o\);
Output: stationary point \(\theta^*\);
begin
Choose an initial point \(\theta_o\);
Initialize \( (s, y) \) storage for \( m \) past updates;
\(t \leftarrow 0\);
repeat:
Compute gradient \( g_t = \nabla \mathcal{L}(\theta_t) \);
Set \( q_t \leftarrow g_t \);
Loop backward (from latest stored \( (s, y) \) pairs):
Compute \( \alpha_i = \frac{s_i^\top q_t}{y_i^\top s_i} \);
Update \( q_t \leftarrow q_t - \alpha_i y_i \);
Set \( r_t \leftarrow H_t^0 q_t \);
Loop forward (from oldest stored \( (s, y) \) pairs):
Compute \( \beta_i = \frac{y_i^\top r_t}{y_i^\top s_i} \);
Update \( r_t \leftarrow r_t + s_i (\alpha_i - \beta_i) \);
Set serch direction \( p_t \leftarrow -r_t \);
Do LINE_SEARCH: find step size \( \eta_t \) along \(p_t\) s.t. \( \mathcal{L}(\theta_t + \eta_t p_t) < \mathcal{L}(\theta_t) \);
Update parameters: \( \theta_{t+1} \leftarrow \theta_t + \eta_t p_t \);
Update gradient \( g_{t+1} \leftarrow \nabla \mathcal{L}(\theta_{t+1}) \);
Store \( (s_t\leftarrow \theta_{t+1} - \theta_t,\, y_t \leftarrow g_{t+1} - g_t) \);
Maintain memory size:If the number of stored \( s, y \) pairs exceeds \( m \), discard the oldest pair;
\( t \leftarrow t + 1\);
until \( \|g_t\| < \epsilon \);
Output \(\theta_t\);
end
Note: For computational efficiency and numerical stability, instead of calculating \(\frac{1}{s^\top y}\) multiple times, usually
a new variable \(\rho = \frac{1}{s^\top y}\) will be introduced.
Python Implementation
A sample code for L-BFGS is as follows:
import numpy as np
# Line search with Wolfe conditions
def line_search(f, grad_f, theta, p, c1 = 1e-4, c2 = 0.9, max_iter = 100):
eta = 1.0
eta_low = 0.0
eta_high = None
phi_0 = f(theta)
grad_phi_0 = np.dot(grad_f(theta), p)
for _ in range(max_iter):
phi_eta = f(theta + eta * p)
# Check Armijo condition
if phi_eta > phi_0 + c1 * eta * grad_phi_0:
eta_high = eta
else:
# Check Curvature condition
grad_phi_eta = np.dot(grad_f(theta + eta * p), p)
if grad_phi_eta < c2 * grad_phi_0:
eta_low = eta
else:
return eta
# Update step size using bisection method
if eta_high is not None:
eta = (eta_low + eta_high) / 2.0
else:
eta *= 2.0
return eta
# Limited memory BFGS
def limited_bfgs(f, grad_f, theta0, m = 10, tol = 1e-6, max_iter = 2000):
theta = theta0.copy()
g = grad_f(theta)
s_list = []
y_list = []
rho_list = [] # We introduce rho = 1/s^T y instead of directly using 1/s^T y for efficiency & stability.
for _ in range(max_iter):
if np.linalg.norm(g) < tol:
break
q = g.copy()
alpha_list = [] # Need this for the second loop to avoid computing alpha again
# Loop backward through stored (s, y) pairs
for s, y, rho in reversed(list(zip(s_list, y_list, rho_list))):
alpha = rho * np.dot(s, q)
alpha_list.append(alpha)
q -= alpha * y
# Initial Hessian approximation is identity: H0 = I, so H0 * q = q
r = q
# Loop forward through stored (s, y) pairs
for (s, y, rho), alpha in zip(zip(s_list, y_list, rho_list), reversed(alpha_list)):
beta = rho * np.dot(y, r)
r += s * (alpha - beta)
# Search direction
p = -r
# Compute the step size by Line search satisfying Wolfe conditions
eta = line_search(f, grad_f, theta, p)
# Update parameters
theta += eta * p
grad_next = grad_f(theta)
# Update memory for (s, y) pairs
s = eta * p
y = grad_next - g
if np.dot(s, y) > 1e-6:
if len(s_list) == m:
s_list.pop(0)
y_list.pop(0)
rho_list.pop(0)
s_list.append(s)
y_list.append(y)
rho_list.append(1.0 / np.dot(y, s))
# Update gradient
g = grad_next
return theta
# Objective function and its gradient:
# The Rosenbrock function is commonly used for testing optimization algorithms.
def rosenbrock(x):
return np.sum(100 * (x[1:] - x[:-1]**2)**2 + (1 - x[:-1])**2)
def grad_rosenbrock(x):
grad = np.zeros_like(x)
grad[:-1] = -400 * x[:-1] * (x[1:] - x[:-1]**2) - 2 * (1 - x[:-1]) # # x_1 to x_n-1
grad[1:] += 200 * (x[1:] - x[:-1]**2) # x_2 to x_n
return grad
# Finite difference gradient to check grad_rosenbrock().
def finite_difference_gradient(f, x, epsilon=1e-6):
grad = np.zeros_like(x)
for i in range(len(x)):
x_forward = x.copy()
x_backward = x.copy()
x_forward[i] += epsilon
x_backward[i] -= epsilon
grad[i] = (f(x_forward) - f(x_backward)) / (2 * epsilon)
return grad
if __name__ == "__main__":
n = 50 # Dimensionality
# Randomly generate an initial point x0:
# You could try : x0 = np.ones(n) + 0.3 * np.random.randn(n), which represents adding small
# perturbation around the global minimum x* = [1, ... , 1]^T
x0 = np.random.randn(n)
numeric_opt = limited_bfgs(rosenbrock, grad_rosenbrock, x0)
print("Initial point: \n", x0.tolist())
print("\n Numerical optimum: \n", numeric_opt.tolist())
'''
# If you are not sure about the gradient of the objective, always you can compare it with finite difference.
grad_analytic = grad_rosenbrock(x0)
grad_numeric = finite_difference_gradient(rosenbrock, x0)
relative_error_grad = np.linalg.norm(grad_analytic - grad_numeric) / np.linalg.norm(grad_analytic)
print("Relative error:", relative_error_grad)
'''
# Relative error between the numerical optimum and the global optimum
global_opt = np.ones(n) # The actual global optimum of Rosenbrock function is x* = [1, ... , 1]^T
relative_error_optimal = np.linalg.norm(numeric_opt - global_opt) / np.linalg.norm(global_opt)
print(f"\n Relative error to the global minimum: {relative_error_optimal*100:.8f}%")
Note: On the above code, we used the Rosenbrock function, which is defined as
\[
f(x) = \sum_{i = 1}^{n-1} a (x_{i+1} - x_i^2)^2 + (b - x_i)^2
\]
where \(x \in \mathbb{R}^n\), and usually we set the coefficients \(a = 100, \, b = 1\).
The global minimum can be obtained simply at \(x = [1, 1, \cdots, 1]\), but in numerical optimization, converging to
the global minimum is difficult because of several reasons:
Broad "flat" regions & several local minima: Optimization algorithms can be trapped there.
Ill-conditioned: Small changes in one variable can result in large changes in another.
A narrow, curved valley containing the global minimum: It makes hard for optimization algorithms to navigate. The
steep sides of the valley can cause algorithms to overshoot or converge very slowly.
Rosenbrock function is a common benchmark for testing the robustness and efficiency of optimization algorithms.