Duality in Optimization & Analysis

Duality

Duality is a fundamental concept in optimization, arising in many settings, including machine learning, convex analysis, and game theory. It would allow us to transform a complex primal problem into an often simpler dual problem, providing insights into optimality, feasibility, and algorithmic efficiency.

In constrained optimization, consider a primal problem \[ \min_x f(x) \, \text{ subject to } g_i (x) \leq 0, \, h_j (x) = 0 \] where \(f(x)\) is the objective function, \(g_i(x)\) are inequality constraints, and \(h_i(x)\) are equality constraints. We define the Lagrangian: \[ \mathcal{L}(x, \lambda, \nu) = f(x) + \sum_i \lambda_i g_i(x) + \sum_j \nu_j h_j(x) \] where \(\lambda_i \geq 0\) are the Lagrange multipliers(also called dual variables for inequality constraints) and \(\nu_j\) are dual variables for equality constraints.

The dual problem is then: \[ \max_{\lambda \geq 0, \nu} \, \inf_x \mathcal{L}( x, \lambda, \nu). \] The optimal value of the dual problem, \(d^*\), provides a lower bound on the optimal value of the primal problem, \(p^*\): \[ d^* \leq p^*. \] This fundamental result is known as weak duality, which always holds for any optimization problem where a valid dual problem is formulated.

The duality gap is the difference between the primal and dual optimal values: \[ p^* - d^*. \] If the duality gap is zero (\( p^* = d^* \)), we have strong duality, meaning the dual problem perfectly represents the primal problem.

Strong duality does NOT always hold, especially in non-convex problems. For convex problems, Slater's condition guarantees strong duality if there exists a strictly feasible point \(x\) such that \[ \forall i, \quad g_i(x) < 0 \] in other words, at least one point strictly satisfies the inequality constraints.

In constrained optimization, duality is closely related to the KKT conditions. Indeed, the dual problem's optimal solution corresponds to the KKT multipliers (Lagrange multipliers), which characterize the constraints' influence on the optimal solution. So, if strong duality holds, then the KKT conditions are necessary and sufficient for optimality.

Note: Duality is widely used in machine learning and statistical learning theory:

Support Vector Machines(SVMs):

kernel tricks

Regularization (Lasso, Ridge, e.t.c.):

generalization bounds

Convex Optimization in Deep Learning:

Lipschitz Continuity

In convex optimization and machine learning, strong duality and efficient optimization often require certain regularity conditions, such as Lipschitz continuity (smoothness condition) of the objective function or its gradient. These conditions impact the stability and convergence of optimization algorithms.

Lipschitz Continuity: Let \((X, d)\) and \((Y, e)\) be metric spaces and \(f: X \to Y\). If there exists \(L \in \mathbb{R}^+\) such that \[ \forall a, b \in X, \, e(f(a), f(b)) \leq L d(a, b) \] then \(f\) is said to be Lipschitz continuous on \(X\) with Lipschitz constant \(L\).

While Lipschitz continuity has broad applications in mathematical analysis, its significance in optimization is boundedness. A key advantage of Lipschitz continuity is that it ensures function values or gradients do not grow uncontrollably. In other words, Lipschitz continuity provides an upper bound on the rate of change of a function. This boundedness property helps prevent numerical instability, guarantees uniform convergence rates, and ensures that optimization trajectories remain well-defined.

In many optimization problems, smoothness is an important property that ensures well-behaved gradients. Specifically, a continuously differentiable function \(f: \mathbb{R}^n \to \mathbb{R}\) is said to be \(L\)-smooth if its gradient is Lipschitz continuous, meaning \[ \| \nabla f(x) - \nabla f(y) \| \leq L \|x - y\|, \, \forall x, y \in \mathbb{R}^n. \] Intuitively, this condition ensures that \(f\) does not change too abruptly, allowing gradient-based optimization methods to converge predictably.

If the gradient is Lipschitz continuous, steepest descent converges at a linear rate . That is, there exists \( 0 < \mu < 1 \) such that: \[ |f(x_{t+1}) - f(x_*) | \leq \mu | f(x_t) - f(x_*)|. \] Here, \(\mu\) is called the convergence rate.

For example, consider a quadratic loss function: \[ f(x) = \frac{1}{2}x^T A x + b^T x + c \tag{1} \] where \(A \in \mathbb{R}^{n \times n}\) is symmetric positive definite, \(b \in \mathbb{R}^n\) and \(c \in \mathbb{R}\) is a scalar constant.
(Remember that we have an unique optimal point \(x^* = -A^{-1}b\).)

To analyze the convergence of steepest descent, we first introduce the concept of a contraction mapping, which plays a key role in linear convergence analysis.

Contraction Mapping: Suppose \((M, d)\) is a metric space. A map \(f: M \to M\) is said to be a contraction on \(M\) if and only if there exists a constant \(0 \leq k < 1\) such that: \[ \forall x, y \in M, \, d(f(x), f(y)) \leq k \, d(x, y). \] This constant \(k\) is often called the contraction factor. The contraction mapping \(f\) is a Lipschitz function with Lipschitz constant less than 1 and is uniformly continuous on its domain. If an iterative method satisfies this contraction property, it ensures that the error shrinks at every step, leading to convergence.

Now, suppose we apply steepest descent with a fixed step size \(\alpha\). To analyze the function relative to the optimum \(x^*\), without loss of generality, Function (1) can be rewritten as \[ f(x) = \frac{1}{2}(x - x^*)^T A (x - x^*) \] with gradient: \[ \nabla f(x) = A(x-x^*). \] Then steepest descent iteration: \[ x_{k+1} = x_k - \alpha \nabla f(x_k) \] can be rewritten as: \[ x_{k+1} - x^* = (I - \alpha A)(x_k - x^*). \] For the iteration to be a contraction (i.e., to ensure the error shrinks), we require: \[ | 1 - \alpha \lambda_i | < 1, \, \forall i \tag{2} \] where \(\lambda_i\) are eigenvalues of \(A\). This ensure that \(I - \alpha A\) has eigenvalues strictly inside the unit circle.
Let \(\lambda_{\min}\) and \(\lambda_{\max}\) be the smallest and largest eigenvalues of \(A\), respectively. From Inequality (2), we get \[ 0 < \alpha \lambda_i < 2 \Longrightarrow \alpha < \frac{2}{\lambda_{\max}} \] Thus, for \(\alpha < \frac{2}{\lambda_{\max}}\), the matrix \(I -\alpha A\) is a contraction with respect to the Euclidean norm. This condition is critical because it guarantees that the error diminishes at each step of iteration.
To achieve the fastest convergence rate (i.e., to minimize the worst-case contraction factor), we choose \(\alpha^*\) such that the two extreme eigenvalues of \(I -\alpha A\) have the same absolute deviation from 1: \[ |1 - \alpha^* \lambda_{\min}| = |1 - \alpha^* \lambda_{\max}|. \] Assuming \(1 - \alpha^* \lambda_{\min} \geq 0\) and \(1 - \alpha^* \lambda_{\max} \leq 0\), we solve: \[ 1 - \alpha^* \lambda_{\min} = \alpha^* \lambda_{\max} - 1, \] which gives: \[ \alpha^* = \frac{2}{\lambda_{\min} + \lambda_{\max}}. \] With this step size, the contraction factor is: \[ k = 1 - \alpha^*\lambda_{\min} = \frac{\lambda_{\max} - \lambda_{\min}}{\lambda_{\max} + \lambda_{\min}} = \frac{\kappa-1}{\kappa+1}, \] where \(\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\) is the condition number.
Since function error decreases roughly as the square of the iterate error, the function value convergence rate is: \[ \mu = k^2 = \left(\frac{\lambda_{\max} - \lambda_{\min}}{\lambda_{\max} + \lambda_{\min}}\right)^2 = \left(\frac{\kappa-1}{\kappa+1}\right)^2. \]
Note: In general, problems with low condition numbers (i.e., when \(\kappa\) is close to 1) converge (or train) faster. On the other hand, a high condition number (ill-conditioned matrix \(A\)) leads to slower convergence.

Interactive Duality Visualization

This interactive visualization demonstrates the relationship between a primal problem and its dual in linear programming. By adjusting the parameters, you can observe how both problems change and how concepts like strong duality and complementary slackness work.

Duality in Optimization & Analysis

Calculus to Optimization & Analysis

Duality

Lipschitz Continuity

Interactive Duality Visualization