【SI152笔记】part4:无约束非线性规划
SI152: Numerical Optimization
Lec 9: Nonlinear Programming,Line Search Method
Fundamentals for nonlinear unstrained optimization
Theorem 1 (Mean Value Theorem)
Given \(f \in C\), \(x \in\mathbb{R}^n\), and \(d\in\mathbb{R}^n\), there exists \(\alpha\in(0, 1)\) such that $$f(x + d) = f(x) + \nabla f(x + αd)^T d$$
Theorem 2 (Taylor’s Theorem)
Given \(f \in C^2\), \(x \in\mathbb{R}^n\), and \(d\in\mathbb{R}^n\), there exists \(\alpha\in(0, 1)\) such that $$f(x + d) = f(x) + \nabla f(x)^T d + \frac{1}{2}d^T \nabla^2 f(x + αd) d$$
Definition 3 (Convex function)
A function \(f : \mathbb{R}^n \to \mathbb{R}\) is convex if for all \({x1, x2} \subset \mathbb{R}^n\) and \(α \in [0, 1]\) we have
- \(f\) is concave if \(−f\) is convex.
- strictly convex: if for x1 = x2, the above inequality holds strictly.
- If \(f : \mathbb{R}^n \to \mathbb{R}\) is convex, then it is continuous.
- Addition, Maximization, Composition, preserving convexity.
Definition 4
The epigraph of \(f\) is \(\mathsf{epi}(f) := \{(x, z) : x \in X , z \in R, \text{and } f(x) ≤ z\}\).
\(\mathsf{dom}(f) := \{x | x\in X \text{ and } f(x) < \infty \}\).
Theorem 5
Let \(\mathcal{X}\) be a nonempty convex subset of \(\mathbb{R}^n\) and let \(f : \mathbb{R}^n \to \mathbb{R}\) be differentiable over an open set containing \(\mathcal{X}\). Then, the following hold:
- (a) f is convex over \(\mathcal{X}\) if and only if, for all {x1, x2} ⊂ X, we have
- (b) f is strictly convex over \(\mathcal{X}\) if and only if the above inequality is strict when \(x_1 = x_2\).
Theorem 6
Let \(\mathcal{X}\) be a nonempty convex subset of \(\mathbb{R}^n\) and let \(f : \mathbb{R}^n \to \mathbb{R}\) be twice continuously differentiable over an open set containing \(\mathcal{X}\). Then, thefollowing hold:
- (a) If \(\nabla^2f(x)\) is positive semi-definite for all \(x\in\mathcal{X}\) , then \(f\) is convex over \(\mathcal{X}\) .
- (b) If \(\nabla^2f(x)\) is positive definite for all \(x\in\mathcal{X}\), then \(f\) is convex over \(\mathcal{X}\).
- (c) If \(\mathcal{X}\) is open and \(f\) is convex over \(\mathcal{X}\), then \(\nabla^2f(x)\) is positive definite for all \(x\in\mathcal{X}\).
Definition 7 (Subgradient and Subdifferential)
A vector $g \in \mathbb{R}^n $ is a subgradient of a proper convex f at \(x \in dom(f)\) if
The set of all subgradients of \(f\) at \(x\), denoted \(∂f(x)\), is the subdifferential of \(f\) at \(x\).
- Let \(f : \mathbb{R}^n\to \mathbb{\bar{R}}\) be proper and convex.
- If \(x ∈ dom(f)\), then \(g ∈ ∂f(x)\) if and only if
\[f'(d; x) ≥ g^T d , \forall d \in\mathbb{R}^n \]- If \(x ∈ int dom(f)\), then \(∂f(x)\) is a nonempty, convex, and compact and
\[f'(d; x) ≥ \max_{g\in\partial f(x)} g^T d , \forall d \in\mathbb{R}^n \]
Descent directions
At a point \(x \in \mathbb{R}^n\), a descent direction \(d\) is one for which we have \(∇f(x)^T
d = f'(d; x) < 0\).
We can decrease \(f\) by moving (a small distance) along such a direction \(d\).
The steepest descent direction is \(d = -\nabla f(x)\).
Optimality conditions
Definition 10 (Global minimum)
A vector \(x^∗\) is a global minimum of \(f\) if \(f(x^∗) ≤ f(x) ,\forall x \in\mathbb{R}^n\).
Definition 11 (Local minimum)
A vector \(x^∗\) is a global minimum of \(f\) if there exists \(\epsilon > 0\) such that \(f(x^∗) ≤ f(x) ,\forall x \in B(x^∗, \epsilon) := \{x \in\mathbb{R}^n | \lVert x − x^∗ \rVert _2 ≤ \epsilon \}\).
Theorem 12
If \(x \in \mathbb{R}^n\) is convex, then a local minimum of \(f\) is a global minimum of f. If \(f\) is strictly convex, then there exits at most one global minimum of \(f\).
Theorem 13 (First-order necessary condition)
If \(f \in C\) and \(x^∗\) is a local minimizer of \(f\), then \(\nabla f(x^∗) = 0\).
Definition 14
A point \(x \in \mathbb{R}^n\) is a stationary point for \(f \in C\) if \(\nabla f(x) = 0\).
Theorem 16 (Second-order necessary condition)
If \(f \in C^2\) and \(x^∗\) is a local minimizer of \(f\), then \(\nabla^2f(x^∗) \succ 0\).
Theorem 17 (Second-order sufficient condition)
If \(f \in C^2\), \(\nabla f(x^∗) = 0\) and \(\nabla^2f(x^∗) \succ 0\), then \(x^*\) is a strict local minimizer.
Line search method
Line search philosophy: compute \(d_k\) and then compute \(α_k > 0\) so that \(x_{k+1} \gets x_k + α_k d_k\) is “better” than \(x_k\) in some way.
Common choices for \(d_k\):
- the steepest descent direction (gradient descent): \(d_k = -\nabla f(x_k)\).
- the Newton direction: \(d_k = -\nabla^2 f(x_k)^{-1} \nabla f(x_k)\).
- Some approximation of Newton direction (Quasi-Newton): \(d_k = -H_k^{-1} \nabla f(x_k)\).
Choice of \(\alpha_k\):
- At least ensure that for \(x_{k+1} \gets x_k + α_k d_k\), we have \(f(x_{k+1}) < f(x_k)\).
- At most solve the one-dimensional (nonlinear) minimization problem: \(\min_{\alpha\geq 0} f(x_k + \alpha d_k)\).
Sufficient decrease condition
the Armijo condition:
where \(c_1 \in (0, 1)\) is a user-specified constant:
- \(c_1 = 0\) is too loose of a requirement
- \(c_1 = 1\) is too strict, and may not be satisfiable if curvature is strictly positive.
Backtracking line search
Algorithmically, choose the largest value in the set \(\{ \gamma^0,\gamma^1,\gamma^2,\dots \}\) where \(\gamma\in(0, 1)\) is a given constant, satsifying the Armijo condition.
Wolfe conditions
the curvature condition:
where \(c_2\in(c_1, 1)\) is a user-specified constant.
Use this to let the answer away from \(\alpha = 0\).
The Armijo and curvature conditions together compose the Wolfe conditions.
Theorem 18 (Zoutendijk’s Theorem)
Suppose that \(f\) is bounded below and continuously differentiable in an open set \(\mathcal{N}\) containing the sublevel set \(L := \{x | f(x) ≤ f(x_0)\}\). Suppose also that \(\nabla f\) is Lipschitz continuous on \(\mathcal{N}\) with constant \(L\). Consider any iteration of the form \(x_{k+1} \gets x_k + α_k d_k\) for all \(k \in\mathbb{N}_+\), where, for all \(k\), \(d_k\) is a descent direction, and \(α_k\) satisfies the Wolfe conditions. \(θ_k\) is the angle between \(d_k\) and \(−\nabla f(x_k)\). Then,
Theorem 19
For \(L\)-smooth and \(µ\)-strongly convex \(f\), Gradient descent (\(d_k = -\nabla f(x_k)\)) with fixed step size
\(\alpha \leq \frac{2}{\mu + L}\) satisfies
So it's linear convergence.
Lec 10: Quasi-Newton Method
Newton’s method is fast, but expensive, since it requires Hessians and the solution of a linear system to compute the search direction.
So for Quasi-Newton method, rather than compute \(∇^2 f(x_k)\), we compute an approximation \(H_k\), updating \(H_k\) in each iteration.
The model in each iteration:
For next iteration:
It should satisfies:
Then we have the “secant equation”
For \(n\geq 1\), the \(H_{k+1}\) is not unique.
Symmetric-rank-1 updates
The symmetric-rank-1 (SR1) method requires Hk+1 to be symmetric and
enforces: Hk+1sk = yk (secant equation)
Hk+1 = Hk + σvvT (rank 1 update);
Symmetric-rank-2 updates
Davidon-Fletcher-Powell update
Broyden-Fletcher-Goldfarb-Shanno update
Convergence
Superlinear convergence of BFGS method