Introduction to lower bounds in optimization

by Samuel Vaiter

Most of the bounds that are described in the optimization litterature are upper bounds of the form x(t)xα(t) or f(x(t))f(x)α(t). But what about findings the reverse-side inequality x(t)xβ(t)? Said otherwise, what can we achieve with a “gradient-descent-like” algorithm?

To formalize this notion, we consider algorithm, here sequences (x(t))t0, that build upon the previous iterates with only access to a first-order oracle:

Assumption 1 (First-order method).

We assume that a first-order method is given by a sequence x(t) such that

x(t)x(0)+Span{f(x(0)),,f(x(t1))}.

We shall note that one can thinks of a more general way to define first-order methods, but for the sake of the results we aim to prove, such level of generality is enough.

With this assumption in mind, how to design a function adversarial to these type of schemes? The idea is to find a function such that the gradient at step t1 gives minimal information, i.e., it has a minimal nonzero partial derivatives. A way to define such function is to “stack” quadratic functions with increasing dependencies between variables:

fkL,μ(x)=Lμ8((x11)2+i=1k1(xi+1xi)2+xk2)+μ2x2, (1)

where 0μ<L and 0kd.

fkL,μxi(x)=μxi+Lμ4{x2+2x11if i=1xi+1+2xixi1if 2i<k2xkxk1if i=k0otherwise.

Set f=fkL,μ. Observe that if we start from x(0)=0, then

x(1)=x(0)ηf(0)=ηLμ4e1e1,

that is only the first coordinate is updated after one iteration. What happens now that we have access to x(0), fk(x(0)) and fk(x(1))? An algorithm satisfying Assumption 1, we look at

x(2)=x0+αf(x(0))+βf(x(1)).

One can check that for any (α,β), x(2)e1+e2, and by an easy induction, we have x(t)k=1tek: any first-order methods will only be able to update at most one new coordinate at each iteration. We are going to prove the following result.

Theorem 1 (Lower-bound for smooth convex optimization).

For any d2, x(0)d, L>0, t(n1)/2, there exists a convex function f that is C and L-smooth such that any sequences satisfying Assumption 1 is such that

f(x(t))f(x) 3Lx(0)x232(t+1)2 (2)

where x is a minimizer of f.

Remark that the rate 1/t2 is not achieved by the gradient descent11 1 There exist algorithms that achieve it, in particular Nesterov’s acceleration method.! We also have a lower bound for the class of strongly convex functions.

Theorem 2 (Lower-bound for smooth strongly convex optimization).

For any d2, x(0)d, L>0, there exists a μ-strongly convex function f that is C and L-smooth such that any sequences satisfying Assumption 1 is such that for all t<(n1)/2, we have

x(t)x2 18(Kf1Kf+1)2tx(0)x2, (4)
f(x(t))f(x) μ16(Kf1Kf+1)2tx(0)x2. (5)

where x is the unique minimizer of f.

Note that it is common in the litterature to see Theorem 2 without the factor 18. This is due to an artefact of proof since we prove this result in the finite dimensional case whereas Nesterov (2018) works in the infinite dimensional space 2(). Before proving these important results due to (Nemirovski and Yudin, 1983), we are going to prove several lemmas.

Lemma 1 (Minimizers of fk).

Let d2, L>0, μ0, then fkL,μ defined in (1) is a μ-strongly convex (eventually convex if μ=0) C-function such that its gradient is L-Lipschitz.

If μ=0, it has a unique minimizer xk, satisfying

xik,={1ik+1if 1ik0otherwise,andfkL,0(xk,)=L8(k+1).

If μ>0, it has a unique minimizer xk, satisfying

xik,=s2(k+1)s2(k+1)1si+11s2(k+1)si,

for 1ik, and xik,=0 for i>k, where s=Kf+1Kf1.

Proof.

We drop the exponents L,μ in the definition of fk=fkL,μ. The function fk being a quadratic form, it is C and its partial derivatives read

fkxixj(x)=μ1{i=j}+Lμ4{2if i=jk1if j=i1 and 1<ik1if j=i+1 and 1i<k0otherwise.

Thus, the Hessian matrix is given (for any xd) by

2fk(x)=μIdd+Lμ4Lk,

where Lk is a (thresholded) discrete Laplacian operator with Dirichlet boundary conditions that is tridiagonal

Lk=(2100k,dk12111120dk,k0dk,dk).

Observe that we have (since fk is a quadratic form)

fk(x) =122fk(x)x,xLμ4x1+Lμ8.

Note that:

  1. 1.

    The Hessian is definite (resp. semi-definite) positive if μ>0 (resp. μ=0). Indeed,

    2fk(x)h,h =μh2+Lμ4Lkh,h.

    Since Lkh,h=h12+i=1k(hi+1hi)2+hk20 for any h, the result follows depending on the value of μ.

  2. 2.

    Since (ab)22a2+2b2, we have

    h12+i=1k(hi+1hi)2+hk2h12+i=1k(2hi+12+2hi2)+hk24i=1khi24i=1dhi2=4h2.

    Hence,

    2fk(x)h,hμh2+(Lμ)h2=Lh2.

Thus, we have μId2fk(x)LId.

Let us characterize the (unique) solution xk, of the minimization of fk over d. We aim to solve fk(xk,)=0 to find a critical point (which will be a minimum since we just proved that the Hessian is at least semidefinite positive), that is

μxk,+Lμ4Lkxk,Lμ4e1=0.

Projecting this relation on each coordinate 2ik1, we get that

xi1k,+2xik,xi1k,=4μLμxik,,

which leads to

xik,=12L+μLμ(xi+1k,+xi1k,).

Similarly, we have

x1k,=12L+μLμ(x2k,+1)andxkk,=12L+μLμxk1k,.

Consider y0,,yk+1 defined by yi=xik, for 1ik and y0=1 and yk+1=0. We have the relation

yi=α(yi+1+yi1)whereα=12L+μLμ>0.

We can rewrite it as the second-order linear recursion yi+2α1yi+1+yi=0. The associated trinom is P=X2α1X+1[X] whose discriminant is given by

Δ=(α1)24=16Lμ(Lμ)2.

We distinguish two cases:

  1. 1.

    If μ=0, then the unique root is given by r=1.

  2. 2.

    If μ>0, then the roots are given by

    r =12(α1Δ)=Lμ1Lμ+1=Kf1Kf+1
    s =12(α1+Δ)=Kf+1Kf1=1r.

Case μ=0. We have the affine relation yi=(a+bi)r with constraints y0=a=1 and yk+1=a+b(k+1)=0. In turn, we have yi=1ik+1 and thus

xik,={1ik+1if 1ik0otherwise.

The associated optimal value is given by

fk(xk,)=L8((1k+1)+i=1k11(k+1)2+(1kk+1)2)=L8k+1(k+1)2=L81k+1.

Case μ>0. The solution can be written as

yi=ari+bsiwith{a+b=1ark+1+bsk+1=0.

Thus, we have b=1a, hence aa1=s2(k+1)>0, and in turn we have

a=s2(k+1)s2(k+1)1andb=11s2(k+1).

Hence,

yi=s2(k+1)s2(k+1)1si+11s2(k+1)si.

We now turns to the proofs of Theorem 1 and Theorem 2.

Proof of Theorem 1.

We restrict our attention the the case where x0=0 w.l.o.g. Indeed, if x00, we can set xf~(x)=f(x+x0) and the following proof carry on. Let d=2k+1 and set f=f2k+1L,0.

Remark that

f(x(t))=f2k+1L,0(x(t))=ftL,0(x(t))ft.

Using Lemma 1, we have on one hand that ft=L8(t+1), and then

f(x(t))f(x)=L8(k+1)L82˙(k+1)=L16(k+1).

On the other hand,

x2k+1,x02=x2k+1,2=i=12k+1(x2k+1,)i2
=i=12k+1(1i2(k+1))2
=i=12k+1122(k+1)i=12k+1i+14(k+1)2i=12k+1i2
=(2k+1)1k+12(k+1)(2k+1)2+14(k+1)22(k+1)(4k+3)(2k+1)6
=13(2k+1)(4k+3)4(k+1)
2k+1323(k+1).

Thus,

f(x(t))f(x)x2k+1,x02L16(k+1)23(k+1)=3L32(k+1)2,

that proves (2).

Proof of Theorem 2.

The proof follows the same strategy as before, but we start with a bound on the iterates instead of the objective function. Assume that x(0)=0, otherwise let f~=f(+x0). Consider d=2k+1 and f=f2k+1L,μ. We rewrite the coordinate of x2k+1, as

xi2k+1,=s4(k+1)s4(k+1)1si+11s4(k+1)si=si(1s2i1s4(k+1)1).

On one hand, we have:

x(0)x2k+1,2=i=12k+1(xi2k+1,)2=i=12k+1s2i(1s2i1s4(k+1)1)2i=12k+1s2i,

where we used that for all 1i2k+1, we have

01s2i1s4(k+1)11.

Bounding the tail of the geometric sums, we obtain

x(0)x2k+1,22i=1k+1s2i. (6)

On the other hand, observe that for t<k+1, one has x(t)k,2k+1, thus

x(t)x2k+1,2i=k+12k+1(xi2k+1,)2=i=k+12k+1s2i(1s2i1s4(k+1)1)2.

Since s>0, we have

1s2is4(k+1)11s2(k+1)s4(k+1)1,

and in turn,

1(1s2i1s4(k+1)1)2(1s2(k+1)1s4(k+1)1)20.

Thus,

x(k)x2k+1,2 (1s2(k+1)1s4(k+1)1)2i=k+12k+1s2i
=(1s2(k+1)1s4(k+1)1)2s2ki=1k+1s2i. (7)

Observe that

(1s2(k+1)1s4(k+1)1)214.

Combining it with (6) and (7), we have

x(t)x2k+1,218s2kx(0)x2k+1,218(Kf1Kf+1)2tx(0)x2t+1,2,

proving (4). The value bound (5) is obtained by applying the following inequality

f(x)f(x)+μ2xx2,

to f. ∎

More refined versions of these lower bounds, which provide tighter constants, can be found in Drori and Taylor (2022). However, these improvements come at the price of significantly more involved and technical proofs.

References