Constrained Optimization

Constrained Optimization Problem Lagrange multipliers The KKT Conditions Go: Duality

Constrained Optimization Problem

The constrained optimization problem is given by \[ \theta^* \in \arg \min_{\theta \in \mathcal{C}} \mathcal{L}(\theta): \mathbb{R}^D \to \mathbb{R}. \] Here, \(\mathcal{C}\) is the feasible set as the subset of the parameter space \(\Theta \in \mathbb{R}^D\) that satisfies a set of constraints: \[ \mathcal{C} = \{\theta \in \mathbb{R}^D : g_i (\theta) \leq 0, i \in \mathcal{I}, \, h_j (\theta) = 0, j \in \mathcal{E} \}. \] where \(\mathcal{I}\) is a set of inequality constraints and \(\mathcal{E}\) is a set of equality constraints.

Often, We convert this constrained optimization problem into an unconstrained optimization problem by introducing penalty terms that measure how much we violate each constraint and adding them to the objective function. For example, anpenalty-modified objective function can be written as: \[ \mathcal{L_p}(\theta) = \mathcal{L}(\theta) + \rho \sum_{i\in \mathcal{I}} \max(0, g_i(\theta))^2 + \rho \sum_{j\in \mathcal{E}} h_j(\theta)^2 \] where \(\rho >0\) is a penalty parameter controlling the strength of the penalty.
However, when constraints must be strictly satisfied, or computational resources are not a major concern, following method will provide higher precision results.
Note: For large-scale problems, hybrid approaches like Augmented Lagrangian methods can be better.

Lagrange multipliers

First, we consider a constrained optimization problem without inequality constraints.

Assume we have only one equality constraint \(h(\theta) = 0\). So, the constraint surface is given by \[ \mathcal{C} = \{\theta \in \mathbb{R}^D : h(\theta) = 0\} \] where \(h: \mathbb{R}^m \to \mathbb{R}\) is a differentiable function.
Consider another point, \(\theta + \epsilon \in \mathcal{C}\) with infinitesimal displacement vectors \(\epsilon \in \mathbb{R}^D\). By the firtst-order Taylor expansion around \(\theta\), \[ h(\theta + \epsilon) \approx h(\theta) + \epsilon^T \nabla h(\theta). \] Since both points are on the same constraint surface, \(h(\theta) = h(\theta + \epsilon)\) and thus \[\epsilon^T \nabla h(\theta) \approx 0. \] Since \(\epsilon\) is parallel to the constraint surface, \(\nabla h(\theta)\) must be orthogonal to it. Therefore, for any point \(\theta \in \mathcal{C}\), \(\nabla h(\theta)\) is orthogonal to the constraint surface \(\mathcal{C}\).
Note: More formally, the tangent space of \(\mathcal{C}\) at \(\theta\) consists of all vectors \(\epsilon\) that satisify \(h(\theta + \epsilon) = 0\). By the Taylor expansion, this is equivalent to the set: \[ T_{\theta} \mathcal{C} = \{\epsilon \in \mathbb{R}^D : \epsilon^T \nabla h(\theta) =0\}. \] Since \(\nabla h(\theta)\) satisfied \(\epsilon^T \nabla h(\theta) =0\) for all \(\epsilon \in T_{\theta} \mathcal{C}\), it follows that \(\nabla h(\theta)\) must be orthogonal to the tangent space of \(\mathcal{C}\).

We are looking for a point \(\theta^* \in \mathcal{C}\) such that the objective \(\mathcal{L}(\theta)\) is minimized. As mentioned above, such a point must satisfy the condition that \(\nabla h(\theta^*)\) is orthogonal to the constraint surface \(\mathcal{C}\). In addition, to minimize \(\mathcal{L}(\theta)\) on \(\mathcal{C}\), \(\,\nabla \mathcal{L}(\theta^*)\) also must be orthogonal to the surface. Otherwise, any movement along the surface would decrease the objective \(\mathcal{L}(\theta)\), contradicting the assumption that \(\theta^*\) is a local minimum.
Since both \(\nabla h(\theta)\) and \(\nabla \mathcal{L}(\theta)\) are orthogonal to the constraint surface at \(\theta^*\), they must be parallel(or anti-parallel) to each other. Thus, there exists a constant(Lagrange multiplier), \(\lambda \in \mathbb{R}\) such that \[ \nabla \mathcal{L}(\theta^*) = \lambda \nabla h(\theta^*) \tag{1} \] and an objective(or, Lagrangian) is given by \[ L(\theta, \lambda) = \mathcal{L}(\theta) + \lambda h(\theta). \]
At a stationary point of the Lagrangian, \[ \nabla_{\theta, \lambda} L(\theta, \lambda) = 0 \] which means \[ \lambda \nabla_{\theta} h(\theta) = \nabla \mathcal{L}(\theta), \qquad h(\theta) = 0. \] Such a point is called a critical point, which statisfies \(h(\theta) = 0\) and Equation (1).
For \(m\) equality constraints: \[ L(\theta, \lambda) = \mathcal{L}(\theta) + \sum_{j=1}^p \lambda_j h_j(\theta). \] and \[ \nabla \mathcal{L}(\theta^*) = \lambda^T \nabla h(\theta^*), \quad \lambda \in \mathbb{R}^p \]

The KKT Conditions

Next, consider the case where we have a single inequality constraint \(g(\theta) \leq 0\). We create the lower bound \(\mu g(\theta)\), where \(\mu \geq 0\). So, the Lagrangian can be written as \[ L(\theta, \mu) = \mathcal{L}(\theta) + \mu g(\theta). \] For multiple inequality constraints, \[ L(\theta, \mu) = \mathcal{L}(\theta) + \sum_{i}^m \mu_i g_i(\theta). \] If \(g_i (\theta) = 0\), then the inequality constraint \(g_i (\theta) \leq 0\) is said to be active, and we can build an active set \(\mathcal{A}(\theta) = \{i \in \{1, \cdots, m\}: g_i(\theta) = 0\}\). Technically, we only need the \(i\)th inequality constraint in the active set.

In general, for multiple inequality constraints and equality constraints, we obtain the Lagrangian: \[ L(\theta, \mu, \lambda) = \mathcal{L}(\theta) + \sum_{i \in \mathcal{A}} \mu_i g_i(\theta) + \sum_{j}^p \lambda_j h_j(\theta) \] and our optimization problem becomes \[ \min_{\theta} \max_{\mu \geq 0, \lambda} L(\theta, \mu, \lambda). \tag{P} \]

Theorem 1: Karush-Kuhn-Tucker (KKT) conditions If the objective function \(\mathcal{L}\) and constraint functions \(g\) are convex, then all critical points of the optimization problem (P) satisfy the following conditions:
  1. Primal Feasibility:
  2. \[ g(\theta^*) \leq 0, \quad h(\theta^*) = 0 \] All original constraints must be satisfied at \(\theta^*\).
  3. Stationarity:
  4. \[ \nabla \mathcal{L}(\theta^*) + \sum_i^m \mu_i \nabla g_i (\theta^*) + \sum_j^p \lambda_j \nabla h_j (\theta^*) = 0 \] or equivalently, \[ \nabla \mathcal{L}(\theta^*) + \mu^T \nabla g (\theta^*) + \lambda^T \nabla h (\theta^*) = 0. \] The gradient of the Lagrangian is zero at \(\theta^*\).
  5. Dual feasibility
  6. \[ \mu \geq 0 \] When constraints are violated \(g(x) > 0\), by increasing the objective , pushing the solution toward the feasible region, where the inequality constraints are satisfied \(g(x) \leq 0\).
  7. Complementary slackness
  8. \[ \mu \odot g = 0 \] For each inequality constraint \(g_i(\theta^*) \leq 0, \quad i \in \{1, \cdots, m\}\), either
    • \(g_i(\theta^*) = 0, \, \mu_i \neq 0\) (The local oplimal \(\theta^*\) lies "on" the constraint boundary. The constraint is active.)
    • or
    • \(g_i(\theta^*) < 0, \, \mu_i = 0\) (The constraint is inactive.)

    Note: "\(\odot\)" represents the element-wise product(also known as Hadamard product). Only the constraints in the active set \(\mathcal{A}(\theta^*) = \{i: g_i(\theta^*)=0 \}\)contribute to the optimization.

Note: In addition, we need some regularity conditions for (P). Usually, we assume that the "active" \(\nabla g_i (\theta^*)\) and \(\nabla h_j (\theta^*)\) are linearly independent. This is called the linear independence constraint qualification(LICQ).

This is critical in some machine learning applications such as the support vector machine(SVM) in classification, but here, we introduce a simple optimization problem that can be solved by hand.

Example: Objective: \[ \min f(x) = (x_1-2)^2 + x_2^2 + x_3^2 \] Constraints: \[ \begin{align*} &x_1 + x_2 \leq 1 \Longrightarrow g_1(x) = x_1 + x_2 -1 \leq 0 \tag{g1} \\\\ &x_1 + x_3 \leq 1 \Longrightarrow g_2(x) = x_1 + x_3 -1 \leq 0 \tag{g2} \\\\ &x_1 = x_2 \Longrightarrow h_1(x) = x_1 - x_2 = 0 \tag{h1} \\\\ \end{align*} \] The Lagrangian can be written as: \[ \begin{align*} \mathcal{L}(x, \mu, \lambda) &= f(x) + \mu_1 g_1(x) + \mu_2 g_2(x) + \lambda h(x) \\\\ &= (x_1-2)^2 + x_2^2 + x_3^2 + \mu_1 (x_1 + x_2 -1) + \mu_2 (x_1 + x_3 -1) + \lambda (x_1 - x_2). \end{align*} \] By the stationarity condition, \[ \begin{align*} &\frac{\partial \mathcal{L}}{\partial x_1} = 2x_1 -4 + \mu_1 + \mu_2 + \lambda = 0 \tag{1} \\\\ &\frac{\partial \mathcal{L}}{\partial x_2} = 2x_2 + \mu_1 - \lambda = 0 \tag{2} \\\\ &\frac{\partial \mathcal{L}}{\partial x_3} = 2x_3 + \mu_2 = 0. \tag{3} \\\\ \end{align*} \] Note: Here, if LICQ is not satisfied, you cannot use KKT conditions.
By the dual feasibility condition, \[ \mu_1 \geq 0, \quad \mu_2 \geq 0. \tag{4} \] By the complementary slackness condition, \[ \begin{align*} &\mu_1 (x_1 + x_2 -1) = 0 \tag{5} \\\\ &\mu_2 (x_1 + x_3 -1) = 0 \tag{6} \\\\ \end{align*} \] The optimal \(x^*\) must satisfy all conditions: (g1), (g2), (h1), and (1) ~ (6).
On Equation (5) and (6):
If \(\mu_1 > 0\) and \(\mu_2 >0\), then \(x_1 = x_2 = x_3 = \frac{1}{2}\). However, if \(x_3 = \frac{1}{2}\), Equation (3) gives \(\mu_2 = -1\). This violates Condition (4).
Instead, if \(\mu_1 > 0\) and \(\mu_2 = 0\), then \(x_1 = x_2 = \frac{1}{2}, x_3 = 0\). Using Equation (1) and (2), we obtain \(\mu_1 = 1\) and \(\lambda = 2\). In this case, all conditions are satisfied.
Therefore, the optimal solution is \[ x^* = \begin{bmatrix} \frac{1}{2} \\ \frac{1}{2} \\ 0 \end{bmatrix} \] and \[ f(x^*) = (\frac{1}{2} - 2)^2 + (\frac{1}{2})^2 + 0 = \frac{5}{2}. \]

A sample code for optimization problem with KKT conditions and a simple penalty method is as follows:

                            import numpy as np

                            # Objective function and its gradient
                            def objective(x):
                                return (x[0] - 2) ** 2 + x[1] ** 2 + x[2] ** 2

                            def grad_objective(x):
                                return np.array([2 * (x[0] - 2), 2 * x[1], 2 * x[2]])

                            # Constraints and their gradients
                            def inequality_constraint_1(x):
                                return x[0] + x[1] - 1  # g1

                            def inequality_constraint_2(x):
                                return x[0] + x[2] - 1  # g2

                            def equality_constraint(x):
                                return x[0] - x[1]  # h

                            def grad_inequality_constraint_1():
                                return np.array([1, 1, 0])  # Dg1

                            def grad_inequality_constraint_2():
                                return np.array([1, 0, 1])  # Dg2

                            def grad_equality_constraint():
                                return np.array([1, -1, 0])  # Dh

                            # Simple Penalty Method
                            def penalty_method(rho=0.01, lr=0.05, tol=1e-6, max_iter=2000, rho_growth=1.2, clip_grad=1.0):
                                
                                x = np.array([0.5, 0.5, 0.5])  # Initial point
                                
                                for i in range(max_iter):
                                    # Evaluate constraints with clipping to avoid invalid values
                                    g1 = max(0, inequality_constraint_1(x))  # g1(x)
                                    g2 = max(0, inequality_constraint_2(x))  # g2(x)
                                    h = equality_constraint(x)              # h(x)

                                    # Compute gradients of penalty terms
                                    grad_penalty = (
                                        rho * g1 * grad_inequality_constraint_1() +
                                        rho * g2 * grad_inequality_constraint_2() +
                                        rho * h * grad_equality_constraint()
                                    )

                                    # Gradient of the penalized objective function
                                    grad = grad_objective(x) + grad_penalty

                                    # Gradient clipping to prevent instability
                                    grad = np.clip(grad, -clip_grad, clip_grad)

                                    # Gradient descent step: Learning rate diminishes over iterations
                                    lr_t = lr / (1 + i / 10)  
                                    x -= lr_t * grad
                                
                                    # Check convergence
                                    constraint_violation = max(abs(g1), abs(g2), abs(h))
                                    if np.linalg.norm(grad) < tol and constraint_violation < tol:
                                        print(f"Converged after {i + 1} iterations.")
                                        break

                                    # Increase penalty parameter
                                    if abs(h) > tol or g1 > tol or g2 > tol:
                                        rho = min(rho * rho_growth, 1e6)  # Cap rho at a reasonable level

                                    # Debugging output for monitoring progress
                                    #print(f"Iteration {iteration + 1}: x = {x}, f(x) = {objective(x):.6f}, rho = {rho}")

                                return x, objective(x)

                            # KKT method (with Newton's method)
                            def kkt_method(tol = 1e-8, max_iter = 10):
                                # Initial values of parameters: x1, x2, x3, mu1, mu2, s1, s2, lambda
                                # Note: s1 and s2 are slack variables.
                                parameters = np.array([2.0, 0.0, 0.0, 0.5, 0.5, 0.5, 0.5, 0.5]) 

                                for i in range(max_iter):
                                    x = parameters[:3] / np.array([1.0, 1.0, 1.0]) # Scaling to avoid singular Jacobian or ill-conditioned
                                    mu1, mu2, s1, s2, lambd = parameters[3:] 

                                    # Gradients
                                    grad_f = grad_objective(x)
                                    grad_g1 = grad_inequality_constraint_1()
                                    grad_g2 = grad_inequality_constraint_2()
                                    grad_h = grad_equality_constraint()

                                    # Residuals (KKT conditions)
                                    r_stationarity = grad_f + mu1 * grad_g1 + mu2 * grad_g2 + lambd * grad_h
                                    r_primal_feasibility = np.array([
                                            inequality_constraint_1(x) + s1,  # g1(x) + s1 == 0
                                            inequality_constraint_2(x) + s2,  # g2(x) + s2 == 0
                                            equality_constraint(x)            # h(x) == 0
                                    ])
                                    r_complementarity = np.array([
                                            mu1 * s1,  # mu1*s1 == 0
                                            mu2 * s2   # mu2*s2 == 0
                                    ])

                                    # Residual vector
                                    residuals = np.concatenate([r_stationarity, r_primal_feasibility, r_complementarity])

                                    # Jacobian matrix is 8x8 (8 equations and 8 parameters)
                                    # The initial Jacobian matrix: 
                                    #  x1   x2    x3   mu1  mu2  s1   s2  lambda
                                    # [ 2.   0.   0.   1.   1.   0.   0.   1. ] Stationarity 1
                                    # [ 0.   2.   0.   1.   0.   0.   0.  -1. ] Stationarity 2
                                    # [ 0.   0.   2.   0.   1.   0.   0.   0. ] Stationarity 3
                                    # [ 1.   1.   0.   0.   0.   1.   0.   0. ] Primal feasibility g1 + s1
                                    # [ 1.   0.   1.   0.   0.   0.   1.   0. ] Primal feasibility g2 + s2
                                    # [ 1.  -1.   0.   0.   0.   0.   0.   0. ] Primal feasibility h
                                    # [ 0.   0.   0.   0.5  0.   0.5  0.   0. ] Complementarity 1 mu1 & s1
                                    # [ 0.   0.   0.   0.   0.5  0.   0.5  0. ] Complementarity 2 mu2 & s2
                                    jacobian = np.zeros((8, 8))

                                    # Stationarity (Df + μ1*Dg1 + μ2*Dg2 + λ*Dh)
                                    jacobian[:3, :3] = 2 * np.eye(3)  # Df(x)
                                    jacobian[:3, 3] = grad_g1  # Dg1(x) 
                                    jacobian[:3, 4] = grad_g2  # Dg2(x) 
                                    jacobian[:3, 7] = grad_h   # Dh(x) 

                                    # Primal feasibility (g1(x) + s1, g2(x) + s2, h(x))
                                    jacobian[3, :3] = grad_g1  # Dg1(x) 
                                    jacobian[4, :3] = grad_g2  # Dg2(x) 
                                    jacobian[5, :3] = grad_h   # Dh(x) 
                                    jacobian[3, 5] = 1         # Slack variable s1
                                    jacobian[4, 6] = 1         # Slack variable s2

                                    # Complementarity  (mu1*s1, mu2*s2)
                                    jacobian[6, 3] = s1     # mu1*s1 
                                    jacobian[6, 5] = mu1    # mu1 
                                    jacobian[7, 4] = s2     # mu2*s2 
                                    jacobian[7, 6] = mu2    # mu2 

                                    print(f"Iteration {i+1}:")
                                    print(f" x = {parameters[:3]}, f(x) = {objective(x):.6f}")
                                    print(f"Multipliers: mu1 = {parameters[3]}, mu2 = {parameters[4]}, lambda = {parameters[7]}\n")
                                    
                                    # Solve the system
                                    try:
                                        delta = np.linalg.solve(jacobian, -residuals)
                                    except np.linalg.LinAlgError:
                                        print("Jacobian is singular OR ill-conditioned.")
                                        break

                                    # Update parameters
                                    parameters += delta

                                    # Enforce non-negativity of inequality multipliers and slack variables
                                    parameters[3] = max(0, parameters[3])   # mu1 >= 0
                                    parameters[4] = max(0, parameters[4])   # mu2 >= 0
                                    parameters[5] = max(0, parameters[5])   # s1 >= 0
                                    parameters[6] = max(0, parameters[6])   # s2 >= 0

                                    # Check Convergence
                                    if np.linalg.norm(residuals) < tol:
                                        break

                                x = parameters[:3]
                                return x, parameters[3], parameters[4], parameters[7], objective(x)

                            if __name__ == "__main__":
                                x_opt, mu1, mu2, lambd, f_opt = kkt_method()
                                print(f"Optimal by KKT multipliers: x = {x_opt}, f(x) = {f_opt}")
                                print(f"Multipliers: mu1 = {mu1}, mu2 = {mu2}, lambda = {lambd}\n")
                                
                                x_penalty, f_penalty = penalty_method()
                                x_relative_error  = np.linalg.norm(np.array(x_penalty) - np.array(x_opt)) / np.linalg.norm(np.array(x_opt))
                                f_relative_error  = np.linalg.norm(np.array(f_penalty) - np.array(f_opt)) / np.linalg.norm(np.array(f_opt))
                                print(f"Optimal by Penalty Method: x = {x_penalty}, f(x) = {f_penalty}")
                                print(f"x_Relative_Error: {x_relative_error :.6f}, f_Relative_Error: {f_relative_error :.6f}")
                            

Note: Here, we introduced slack variables(s1 and s2) instead of using the active set. Both the active set and slack variables are essential for ensuring complementary slackness in coding because numerical solvers cannot directly handle inequalities or the discontinuity caused by the \(\mu_i g_i(x) = 0\) condition.

Using slack variables, we can transform inequalities into equalities: \[ g_i(x) \leq 0 \to g_i(x) + s_i = 0, \quad s_i \geq0 \]
On the other hand, the active set method directly identifies and enforces active constraints, potentially reducing computational cost, but requiring more sophisticated logic to update the active set dynamically and check feasibility.