【SI152笔记】part4:无约束非线性规划

SI152: Numerical Optimization

Lec 9: Nonlinear Programming,Line Search Method

Fundamentals for nonlinear unstrained optimization

Theorem 1 (Mean Value Theorem)
Given \(f \in C\), \(x \in\mathbb{R}^n\), and \(d\in\mathbb{R}^n\), there exists \(\alpha\in(0, 1)\) such that $$f(x + d) = f(x) + \nabla f(x + αd)^T d$$

Theorem 2 (Taylor’s Theorem)
Given \(f \in C^2\), \(x \in\mathbb{R}^n\), and \(d\in\mathbb{R}^n\), there exists \(\alpha\in(0, 1)\) such that $$f(x + d) = f(x) + \nabla f(x)^T d + \frac{1}{2}d^T \nabla^2 f(x + αd) d$$

Definition 3 (Convex function)
A function \(f : \mathbb{R}^n \to \mathbb{R}\) is convex if for all \({x1, x2} \subset \mathbb{R}^n\) and \(α \in [0, 1]\) we have

\[f(αx_1 + (1 − α)x_2) \leq αf(x_1) + (1 − α)f(x_2) \]

  • \(f\) is concave if \(−f\) is convex.
  • strictly convex: if for x1 = x2, the above inequality holds strictly.
  • If \(f : \mathbb{R}^n \to \mathbb{R}\) is convex, then it is continuous.
  • Addition, Maximization, Composition, preserving convexity.

Definition 4
The epigraph of \(f\) is \(\mathsf{epi}(f) := \{(x, z) : x \in X , z \in R, \text{and } f(x) ≤ z\}\).
\(\mathsf{dom}(f) := \{x | x\in X \text{ and } f(x) < \infty \}\).

Theorem 5
Let \(\mathcal{X}\) be a nonempty convex subset of \(\mathbb{R}^n\) and let \(f : \mathbb{R}^n \to \mathbb{R}\) be differentiable over an open set containing \(\mathcal{X}\). Then, the following hold:

  • (a) f is convex over \(\mathcal{X}\) if and only if, for all {x1, x2} ⊂ X, we have

\[f(x_2) ≥ f(x_1) + \nabla f(x1)^T(x_2 − x_1) \]

  • (b) f is strictly convex over \(\mathcal{X}\) if and only if the above inequality is strict when \(x_1 = x_2\).

Theorem 6
Let \(\mathcal{X}\) be a nonempty convex subset of \(\mathbb{R}^n\) and let \(f : \mathbb{R}^n \to \mathbb{R}\) be twice continuously differentiable over an open set containing \(\mathcal{X}\). Then, thefollowing hold:

  • (a) If \(\nabla^2f(x)\) is positive semi-definite for all \(x\in\mathcal{X}\) , then \(f\) is convex over \(\mathcal{X}\) .
  • (b) If \(\nabla^2f(x)\) is positive definite for all \(x\in\mathcal{X}\), then \(f\) is convex over \(\mathcal{X}\).
  • (c) If \(\mathcal{X}\) is open and \(f\) is convex over \(\mathcal{X}\), then \(\nabla^2f(x)\) is positive definite for all \(x\in\mathcal{X}\).

Definition 7 (Subgradient and Subdifferential)
A vector $g \in \mathbb{R}^n $ is a subgradient of a proper convex f at \(x \in dom(f)\) if

\[f(\bar{x}) ≥ f(x) + g^T(\bar{x} − x) ,\forall \bar{x}\in \mathbb{R}^n . \]

The set of all subgradients of \(f\) at \(x\), denoted \(∂f(x)\), is the subdifferential of \(f\) at \(x\).

  • Let \(f : \mathbb{R}^n\to \mathbb{\bar{R}}\) be proper and convex.
    • If \(x ∈ dom(f)\), then \(g ∈ ∂f(x)\) if and only if

    \[f'(d; x) ≥ g^T d , \forall d \in\mathbb{R}^n \]

    • If \(x ∈ int dom(f)\), then \(∂f(x)\) is a nonempty, convex, and compact and

    \[f'(d; x) ≥ \max_{g\in\partial f(x)} g^T d , \forall d \in\mathbb{R}^n \]

Descent directions
At a point \(x \in \mathbb{R}^n\), a descent direction \(d\) is one for which we have \(∇f(x)^T d = f'(d; x) < 0\).
We can decrease \(f\) by moving (a small distance) along such a direction \(d\).
The steepest descent direction is \(d = -\nabla f(x)\).

Optimality conditions

Definition 10 (Global minimum)
A vector \(x^∗\) is a global minimum of \(f\) if \(f(x^∗) ≤ f(x) ,\forall x \in\mathbb{R}^n\).

Definition 11 (Local minimum)
A vector \(x^∗\) is a global minimum of \(f\) if there exists \(\epsilon > 0\) such that \(f(x^∗) ≤ f(x) ,\forall x \in B(x^∗, \epsilon) := \{x \in\mathbb{R}^n | \lVert x − x^∗ \rVert _2 ≤ \epsilon \}\).

Theorem 12
If \(x \in \mathbb{R}^n\) is convex, then a local minimum of \(f\) is a global minimum of f. If \(f\) is strictly convex, then there exits at most one global minimum of \(f\).

Theorem 13 (First-order necessary condition)
If \(f \in C\) and \(x^∗\) is a local minimizer of \(f\), then \(\nabla f(x^∗) = 0\).

Definition 14
A point \(x \in \mathbb{R}^n\) is a stationary point for \(f \in C\) if \(\nabla f(x) = 0\).

Theorem 16 (Second-order necessary condition)
If \(f \in C^2\) and \(x^∗\) is a local minimizer of \(f\), then \(\nabla^2f(x^∗) \succ 0\).

Theorem 17 (Second-order sufficient condition)
If \(f \in C^2\), \(\nabla f(x^∗) = 0\) and \(\nabla^2f(x^∗) \succ 0\), then \(x^*\) is a strict local minimizer.

Line search method

Line search philosophy: compute \(d_k\) and then compute \(α_k > 0\) so that \(x_{k+1} \gets x_k + α_k d_k\) is “better” than \(x_k\) in some way.

Common choices for \(d_k\):

  • the steepest descent direction (gradient descent): \(d_k = -\nabla f(x_k)\).
  • the Newton direction: \(d_k = -\nabla^2 f(x_k)^{-1} \nabla f(x_k)\).
  • Some approximation of Newton direction (Quasi-Newton): \(d_k = -H_k^{-1} \nabla f(x_k)\).

Choice of \(\alpha_k\):

  • At least ensure that for \(x_{k+1} \gets x_k + α_k d_k\), we have \(f(x_{k+1}) < f(x_k)\).
  • At most solve the one-dimensional (nonlinear) minimization problem: \(\min_{\alpha\geq 0} f(x_k + \alpha d_k)\).

Sufficient decrease condition

the Armijo condition:

\[f(x_k + α_k d_k) ≤ f(x_k) + c_1 α_k\nabla f(xk)^T d_k \]

where \(c_1 \in (0, 1)\) is a user-specified constant:

  • \(c_1 = 0\) is too loose of a requirement
  • \(c_1 = 1\) is too strict, and may not be satisfiable if curvature is strictly positive.

Algorithmically, choose the largest value in the set \(\{ \gamma^0,\gamma^1,\gamma^2,\dots \}\) where \(\gamma\in(0, 1)\) is a given constant, satsifying the Armijo condition.

Wolfe conditions

the curvature condition:

\[\nabla f(x_k + α_k d_k)^T d_k \geq c_2 \nabla f(x_k)^T d_k \]

where \(c_2\in(c_1, 1)\) is a user-specified constant.
Use this to let the answer away from \(\alpha = 0\).

The Armijo and curvature conditions together compose the Wolfe conditions.

Theorem 18 (Zoutendijk’s Theorem)

Suppose that \(f\) is bounded below and continuously differentiable in an open set \(\mathcal{N}\) containing the sublevel set \(L := \{x | f(x) ≤ f(x_0)\}\). Suppose also that \(\nabla f\) is Lipschitz continuous on \(\mathcal{N}\) with constant \(L\). Consider any iteration of the form \(x_{k+1} \gets x_k + α_k d_k\) for all \(k \in\mathbb{N}_+\), where, for all \(k\), \(d_k\) is a descent direction, and \(α_k\) satisfies the Wolfe conditions. \(θ_k\) is the angle between \(d_k\) and \(−\nabla f(x_k)\). Then,

\[\sum_{k=0}^{\infty} \cos^2 \theta_k \lVert \nabla f(x_k) \rVert^2 < \infty \]

Theorem 19

For \(L\)-smooth and \(µ\)-strongly convex \(f\), Gradient descent (\(d_k = -\nabla f(x_k)\)) with fixed step size
\(\alpha \leq \frac{2}{\mu + L}\) satisfies

\[f(x^k) - f(x^*) \leq \left( \dfrac{L/\mu -1}{L/\mu +1} \right)^{2k} \dfrac{L}{2} \lVert x^0 -x^* \rVert^2 \]

So it's linear convergence.

Lec 10: Quasi-Newton Method

Newton’s method is fast, but expensive, since it requires Hessians and the solution of a linear system to compute the search direction.
So for Quasi-Newton method, rather than compute \(∇^2 f(x_k)\), we compute an approximation \(H_k\), updating \(H_k\) in each iteration.

The model in each iteration:

\[m_k(d) := f(x_k) + \nabla f(x_k)^T d + \frac{1}{2} d^T H_k d \]

For next iteration:

\[m_{k+1}(d) := f(x_{k+1}) + \nabla f(x_{k+1})^T d + \frac{1}{2} d^T H_{k+1} d \]

It should satisfies:

\[\nabla m_{k+1}( -\alpha_k d_k ) = \nabla f(x_{k}) \]

Then we have the “secant equation”

\[H_{k+1} s_k = y_k;~ s_k = x_{k+1} - x_k = \alpha_k d_k,~ y_k = \nabla f(x_{k+1}) - \nabla f(x_{k}) \]

For \(n\geq 1\), the \(H_{k+1}\) is not unique.

Symmetric-rank-1 updates

The symmetric-rank-1 (SR1) method requires Hk+1 to be symmetric and
enforces: Hk+1sk = yk (secant equation)
Hk+1 = Hk + σvvT (rank 1 update);

Symmetric-rank-2 updates

Davidon-Fletcher-Powell update

Broyden-Fletcher-Goldfarb-Shanno update

Convergence

Superlinear convergence of BFGS method

posted @ 2025-01-01 22:08  Coinred  阅读(22)  评论(0)    收藏  举报