Most of the bounds that are described in the optimization litterature are upper bounds of the form or . But what about findings the reverse-side inequality ? Said otherwise, what can we achieve with a “gradient-descent-like” algorithm?
To formalize this notion, we consider algorithm, here sequences , that build upon the previous iterates with only access to a first-order oracle:
Assumption 1 (First-order method).
We assume that a first-order method is given by a sequence such that
We shall note that one can thinks of a more general way to define first-order methods, but for the sake of the results we aim to prove, such level of generality is enough.
With this assumption in mind, how to design a function adversarial to these type of schemes? The idea is to find a function such that the gradient at step gives minimal information, i.e., it has a minimal nonzero partial derivatives. A way to define such function is to “stack” quadratic functions with increasing dependencies between variables:
| (1) |
where and .
Set . Observe that if we start from , then
that is only the first coordinate is updated after one iteration. What happens now that we have access to , and ? An algorithm satisfying Assumption 1, we look at
One can check that for any , , and by an easy induction, we have : any first-order methods will only be able to update at most one new coordinate at each iteration. We are going to prove the following result.
Theorem 1 (Lower-bound for smooth convex optimization).
For any , , , , there exists a convex function that is and -smooth such that any sequences satisfying Assumption 1 is such that
| (2) |
where is a minimizer of .
Remark that the rate is not achieved by the gradient descent11 1 There exist algorithms that achieve it, in particular Nesterov’s acceleration method.! We also have a lower bound for the class of strongly convex functions.
Theorem 2 (Lower-bound for smooth strongly convex optimization).
For any , , , there exists a -strongly convex function that is and -smooth such that any sequences satisfying Assumption 1 is such that for all , we have
| (4) | ||||
| (5) |
where is the unique minimizer of .
Note that it is common in the litterature to see Theorem 2 without the factor . This is due to an artefact of proof since we prove this result in the finite dimensional case whereas Nesterov (2018) works in the infinite dimensional space . Before proving these important results due to (Nemirovski and Yudin, 1983), we are going to prove several lemmas.
Lemma 1 (Minimizers of ).
Let , , , then defined in (1) is a -strongly convex (eventually convex if ) -function such that its gradient is -Lipschitz.
If , it has a unique minimizer satisfying
If , it has a unique minimizer satisfying
for , and for , where .
Proof.
We drop the exponents in the definition of . The function being a quadratic form, it is and its partial derivatives read
Thus, the Hessian matrix is given (for any ) by
where is a (thresholded) discrete Laplacian operator with Dirichlet boundary conditions that is tridiagonal
Observe that we have (since is a quadratic form)
Note that:
-
1.
The Hessian is definite (resp. semi-definite) positive if (resp. ). Indeed,
Since for any , the result follows depending on the value of .
-
2.
Since , we have
Hence,
Thus, we have .
Let us characterize the (unique) solution of the minimization of over . We aim to solve to find a critical point (which will be a minimum since we just proved that the Hessian is at least semidefinite positive), that is
Projecting this relation on each coordinate , we get that
which leads to
Similarly, we have
Consider defined by for and and . We have the relation
We can rewrite it as the second-order linear recursion . The associated trinom is whose discriminant is given by
We distinguish two cases:
-
1.
If , then the unique root is given by .
-
2.
If , then the roots are given by
Case . We have the affine relation with constraints and . In turn, we have and thus
The associated optimal value is given by
Case . The solution can be written as
Thus, we have , hence , and in turn we have
Hence,
∎
Proof of Theorem 1.
We restrict our attention the the case where w.l.o.g. Indeed, if , we can set and the following proof carry on. Let and set .
Remark that
Using Lemma 1, we have on one hand that , and then
On the other hand,
Thus,
that proves (2).
∎
Proof of Theorem 2.
The proof follows the same strategy as before, but we start with a bound on the iterates instead of the objective function. Assume that , otherwise let . Consider and . We rewrite the coordinate of as
On one hand, we have:
where we used that for all , we have
Bounding the tail of the geometric sums, we obtain
| (6) |
More refined versions of these lower bounds, which provide tighter constants, can be found in Drori and Taylor (2022). However, these improvements come at the price of significantly more involved and technical proofs.
References
- On the oracle complexity of smooth strongly convex minimization. Journal of Complexity 68, pp. 101590. External Links: Document Cited by: Introduction to lower bounds in optimization.
- Problem Complexity and Method Efficiency in Optimization. Wiley. Cited by: Introduction to lower bounds in optimization.
- Lectures on convex optimization. Vol. 137, Springer. Cited by: Introduction to lower bounds in optimization.