Duality
Duality is a fundamental concept in optimization, arising in many settings, including machine learning,
convex analysis, and game theory. It would allow us to transform a complex primal problem into an often
simpler dual problem, providing insights into optimality, feasibility, and algorithmic efficiency.
In constrained optimization, consider a primal problem
\[
\min_x f(x) \, \text{ subject to } g_i (x) \leq 0, \, h_j (x) = 0
\]
where \(f(x)\) is the objective function, \(g_i(x)\) are inequality constraints, and \(h_i(x)\) are equality
constraints. We define the Lagrangian:
\[
\mathcal{L}(x, \lambda, \nu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \nu_j h_j(x)
\]
where \(\lambda_i \geq 0\) are the Lagrange multipliers(also called dual variables for inequality constraints) and
\(\nu_j\) are dual variables for equality constraints.
The dual problem is then:
\[
\max_{\lambda \geq 0, \nu} \, \inf_x \mathcal{L}( x, \lambda, \nu).
\]
The optimal value of the dual problem, \(d^*\), provides a lower bound on the optimal value of the primal
problem, \(p^*\):
\[
d^* \leq p^*.
\]
This fundamental result is known as weak duality, which always holds for any optimization problem where a
valid dual problem is formulated.
The duality gap is the difference between the primal and dual optimal values:
\[
p^* - d^*.
\]
If the duality gap is zero (\( p^* = d^* \)), we have strong duality, meaning the dual problem perfectly represents
the primal problem.
Strong duality does NOT always hold, especially in non-convex problems.
For convex problems, Slater's condition guarantees strong duality if there exists a
strictly feasible point \(x\) such that
\[
\forall i, \quad g_i(x) < 0
\]
in other words, at least one point strictly satisfies the inequality constraints.
In constrained optimization, duality is closely related to the KKT conditions.
Indeed, the dual problem's optimal solution corresponds to the KKT multipliers (Lagrange multipliers), which characterize the
constraints' influence on the optimal solution. So, if strong duality holds, then the KKT conditions are necessary and sufficient
for optimality.
Note: Duality is widely used in machine learning and statistical learning theory:
- Support Vector Machines (SVMs):
The dual form of SVM optimization is often solved instead of the primal because it allows for kernel tricks in high-dimensional spaces.
- Regularization (Lasso, Ridge, Elastic Net):
Dual formulations help derive generalization bounds computationally efficient optimization methods.
- Convex Optimization in Deep Learning:
Many training objectives, such as batch normalization and adversarial training, can be analyzed through duality principles.
Lipschitz Continuity
In convex optimization and machine learning, strong duality and efficient optimization often require certain regularity
conditions, such as Lipschitz continuity (smoothness condition) of the objective function or its
gradient. These conditions impact the stability and convergence of optimization algorithms.
Lipschitz Continuity:
Let \((X, d)\) and \((Y, e)\) be
metric spaces and \(f: X \to Y\).
If there exists \(L \in \mathbb{R}^+\) such that
\[
\forall a, b \in X, \, e(f(a), f(b)) \leq L d(a, b)
\]
then \(f\) is said to be
Lipschitz continuous on \(X\) with
Lipschitz constant \(L\).
While Lipschitz continuity has broad applications in mathematical analysis, its significance in optimization is boundedness.
A key advantage of Lipschitz continuity is that it ensures function values or gradients do not grow uncontrollably. In other words, Lipschitz
continuity provides an upper bound on the rate of change of a function. This boundedness property helps prevent numerical instability, guarantees
uniform convergence rates, and ensures that optimization trajectories remain well-defined.
In many optimization problems, smoothness is an important property that ensures well-behaved gradients.
Specifically, a continuously differentiable function \(f: \mathbb{R}^n \to \mathbb{R}\) is said to be
\(L\)-smooth if its gradient is Lipschitz continuous, meaning
\[
\| \nabla f(x) - \nabla f(y) \| \leq L \|x - y\|, \, \forall x, y \in \mathbb{R}^n.
\]
Intuitively, this condition ensures that \(f\) does not change too abruptly, allowing gradient-based optimization methods
to converge predictably.
If the gradient is Lipschitz continuous, steepest descent converges at a linear rate . That is,
there exists \( 0 < \mu < 1 \) such that:
\[
|f(x_{t+1}) - f(x_*) | \leq \mu | f(x_t) - f(x_*)|.
\]
Here, \(\mu\) is called the convergence rate.
For example, consider a quadratic loss function:
\[
f(x) = \frac{1}{2}x^T A x + b^T x + c \tag{1}
\]
where \(A \in \mathbb{R}^{n \times n}\) is symmetric positive definite, \(b \in \mathbb{R}^n\) and \(c \in \mathbb{R}\) is a scalar constant.
(Remember that we have an unique optimal point \(x^* = -A^{-1}b\).)
To analyze the convergence of steepest descent, we first introduce the concept of a contraction mapping, which plays a key role
in linear convergence analysis.
Contraction Mapping:
Suppose \((M, d)\) is a metric space. A map \(f: M \to M\) is said to be a contraction on \(M\)
if and only if there exists a constant \(0 \leq k < 1\) such that:
\[
\forall x, y \in M, \, d(f(x), f(y)) \leq k \, d(x, y).
\]
This constant \(k\) is often called the contraction factor. The contraction mapping \(f\) is a Lipschitz function
with Lipschitz constant less than 1 and is uniformly continuous on its domain. If an iterative method satisfies this
contraction property, it ensures that the error shrinks at every step, leading to convergence.
Now, suppose we apply steepest descent with a fixed step size \(\alpha\).
To analyze the function relative to the optimum \(x^*\), without loss of generality, Function (1) can be rewritten as
\[
f(x) = \frac{1}{2}(x - x^*)^T A (x - x^*)
\]
with gradient:
\[
\nabla f(x) = A(x-x^*).
\]
Then steepest descent iteration:
\[
x_{k+1} = x_k - \alpha \nabla f(x_k)
\]
can be rewritten as:
\[
x_{k+1} - x^* = (I - \alpha A)(x_k - x^*).
\]
For the iteration to be a contraction (i.e., to ensure the error shrinks), we require:
\[
| 1 - \alpha \lambda_i | < 1, \, \forall i \tag{2}
\]
where \(\lambda_i\) are eigenvalues of \(A\). This ensure that \(I - \alpha A\) has eigenvalues strictly inside
the unit circle.
Let \(\lambda_{\min}\) and \(\lambda_{\max}\) be the smallest and largest eigenvalues of \(A\), respectively.
From Inequality (2), we get
\[
0 < \alpha \lambda_i < 2 \Longrightarrow \alpha < \frac{2}{\lambda_{\max}}
\]
Thus, for \(\alpha < \frac{2}{\lambda_{\max}}\), the matrix \(I -\alpha A\) is a contraction with respect to
the Euclidean norm. This condition is critical because it guarantees that the error diminishes at each step of iteration.
To achieve the fastest convergence rate (i.e., to minimize the worst-case contraction factor), we choose \(\alpha^*\)
such that the two extreme eigenvalues of \(I -\alpha A\) have the same absolute deviation from 1:
\[
|1 - \alpha^* \lambda_{\min}| = |1 - \alpha^* \lambda_{\max}|.
\]
Assuming \(1 - \alpha^* \lambda_{\min} \geq 0\) and \(1 - \alpha^* \lambda_{\max} \leq 0\), we solve:
\[
1 - \alpha^* \lambda_{\min} = \alpha^* \lambda_{\max} - 1,
\]
which gives:
\[
\alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}.
\]
With this step size, the contraction factor is:
\[
k = 1 - \alpha^*\lambda_{\min} = \frac{\lambda_{\max} - \lambda_{\min}}{\lambda_{\max} + \lambda_{\min}} = \frac{\kappa-1}{\kappa+1},
\]
where \(\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\) is the condition number.
Since function error decreases roughly as the square of the iterate error, the function value convergence rate is:
\[
\mu = k^2 = \left(\frac{\lambda_{\max} - \lambda_{\min}}{\lambda_{\max} + \lambda_{\min}}\right)^2 = \left(\frac{\kappa-1}{\kappa+1}\right)^2.
\]
Note: In general, problems with low condition numbers (i.e., when \(\kappa\) is close to 1) converge (or train) faster.
On the other hand, a high condition number (ill-conditioned matrix \(A\)) leads to slower convergence.
Interactive Duality Visualization
This interactive visualization demonstrates the relationship between a primal problem
and its dual in linear programming. By adjusting the parameters, you can
observe how both problems change and how concepts like strong duality and complementary slackness work.