Most of the bounds that are described in the optimization litterature
are upper bounds of the form or .
But what about findings the reverse-side inequality ?
Said otherwise, what can we achieve with a “gradient-descent-like”
algorithm?
To formalize this notion, we consider algorithm, here sequences , that build upon the
previous iterates with only access to a first-order oracle:
Assumption 1 (First-order method). We assume that a
first-order method is given by a sequence such that
We shall note that one can thinks of a more general way to define
first-order methods, but for the sake of the results we aim to prove,
such level of generality is enough.
With this assumption in mind, how to design a function adversarial to
these type of schemes? The idea is to find a function such that the
gradient at step gives minimal
information, i.e., it has a minimal nonzero partial
derivatives. A way to define such function is to “stack” quadratic
functions with increasing dependencies between variables: where and . Set . Observe that if we start from , then that is only the first
coordinate is updated after one iteration. What happens now that we have
access to , and ? An algorithm
satisfying Assumption , we look at One can check that for any , , and by an easy induction, we have : any first-order methods will only be able to update at
most one new coordinate at each iteration. We are going to prove the
following result.
Theorem 1 (Lower-bound for smooth convex
optimization). For any , , ,
, there exists a
convex function that is and -smooth such that any sequences
satisfying Assumption is such that where is a minimizer of .
Remark that the rate is
not achieved by the gradient descent! We also have a lower
bound for the class of strongly convex functions.
Theorem 2 (Lower-bound for smooth strongly convex
optimization). For any , , ,
there exists a -strongly convex
function that is and -smooth such that any sequences
satisfying Assumption is such that for all , we have where is the unique minimizer of
.
Note that it is common in the litterature to see Theorem without the
factor . This is due to
an artefact of proof since we prove this result in the finite
dimensional case whereas Nesterov (2018) works in the infinite dimensional
space . Before
proving these important results due to (Nemirovski and Yudin 1983), we are going
to prove several lemmas.
Lemma 1 (Minimizers of ). Let , , , then
defined in 0.2 is a -strongly convex (eventually convex if
) -function such that its
gradient is -Lipschitz.
If , it has a unique
minimizer satisfying
If , it has a unique
minimizer satisfying
for , and for , where
.
Proof. We drop the exponents in the definition of . The function being a quadratic form, it is and its partial derivatives
read Thus, the Hessian matrix is given
(for any ) by
where is a (thresholded)
discrete Laplacian operator with Dirichlet boundary conditions that is
tridiagonal Observe that we have (since is a quadratic form) Note that:
The Hessian is definite (resp. semi-definite) positive if (resp. ). Indeed, Since for any
, the result follows depending on
the value of .
Since , we have Hence,
Thus, we have .
Let us characterize the (unique) solution of the minimization of over . We aim to solve to find a
critical point (which will be a minimum since we just proved that the
Hessian is at least semidefinite positive), that is
Projecting this relation on each coordinate , we get that which leads to
Similarly, we have Consider defined by for and and . We have the relation We can rewrite it as the second-order linear
recursion . The associated trinom is whose discriminant is given by We distinguish two
cases:
If , then the unique
root is given by .
If , then the
roots are given by
Case . We have the
affine relation
with constraints and
. In turn,
we have
and thus The associated optimal value is
given by
Case . The
solution can be written as Thus, we have , hence , and in
turn we have Hence, ◻
We now turns to the proofs of Theorem and Theorem .
Proof of Theorem . We restrict our attention the the case
where w.l.o.g. Indeed, if
, we can set and
the following proof carry on. Let and set .
Remark that Using Lemma , we have on one hand that , and
then On the other hand, Thus, that proves 0.6. ◻
Proof of Theorem . The proof follows the same strategy as
before, but we start with a bound on the iterates instead of the
objective function. Assume that , otherwise let . Consider and . We rewrite the coordinate of as
On one hand, we have: where we used
that for all , we
have Bounding the tail of
the geometric sums, we obtain
On the other hand, observe that for , one has , thus Since , we have and in turn, Thus, Observe that Combining it with
0.38 and 0.42, we have proving 0.8. The value
bound 0.9 is
obtained by applying the following inequality to . ◻
More refined versions of these lower bounds, which provide tighter
constants, can be found in Drori and Taylor (2022). However, these improvements come
at the price of significantly more involved and technical proofs.
Drori, Yoel, and Adrien Taylor. 2022.
“On the Oracle Complexity of
Smooth Strongly Convex Minimization.” Journal of
Complexity 68 (February): 101590.
https://doi.org/10.1016/j.jco.2021.101590.
Nemirovski, Arkadi, and David Yudin. 1983. Problem
Complexity and Method Efficiency in
Optimization. Wiley.
Nesterov, Yurii. 2018. Lectures on Convex Optimization. Vol.
137. Springer.