Introduction to lower bounds in optimization

June 12, 2025

This content is available as a PDF here.

Most of the bounds that are described in the optimization litterature are upper bounds of the form x(t)xα(t) or f(x(t))f(x)α(t). But what about findings the reverse-side inequality x(t)xβ(t)? Said otherwise, what can we achieve with a “gradient-descent-like” algorithm?

To formalize this notion, we consider algorithm, here sequences (x(t))t0, that build upon the previous iterates with only access to a first-order oracle:

  • Assumption 1 (First-order method). We assume that a first-order method is given by a sequence x(t) such that

    x(t)x(0)+Span{f(x(0)),,f(x(t1))}.

We shall note that one can thinks of a more general way to define first-order methods, but for the sake of the results we aim to prove, such level of generality is enough.

With this assumption in mind, how to design a function adversarial to these type of schemes? The idea is to find a function such that the gradient at step t1 gives minimal information, i.e., it has a minimal nonzero partial derivatives. A way to define such function is to “stack” quadratic functions with increasing dependencies between variables:

(1)fkL,μ(x)=Lμ8((x11)2+i=1k1(xi+1xi)2+xk2)+μ2x2,

where 0μ<L and 0kd.

fkL,μxi(x)=μxi+Lμ4{x2+2x11if i=1xi+1+2xixi1if 2i<k2xkxk1if i=k0otherwise.

Set f=fkL,μ. Observe that if we start from x(0)=0, then

x(1)=x(0)ηf(0)=ηLμ4e1Re1,

that is only the first coordinate is updated after one iteration. What happens now that we have access to x(0), fk(x(0)) and fk(x(1))? An algorithm satisfying Assumption 1, we look at

x(2)=x0+αf(x(0))+βf(x(1)).

One can check that for any (α,β), x(2)Re1+Re2, and by an easy induction, we have x(t)k=1tRek: any first-order methods will only be able to update at most one new coordinate at each iteration. We are going to prove the following result.

  • Theorem 1 (Lower-bound for smooth convex optimization). For any d2, x(0)Rd, L>0, t(n1)/2, there exists a convex function f that is C and L-smooth such that any sequences satisfying Assumption 1 is such that

    (2)f(x(t))f(x)3Lx(0)x232(t+1)2 where x is a minimizer of f.

Remark that the rate 1/t2 is not achieved by the gradient descent1! We also have a lower bound for the class of strongly convex functions.

1 There exist algorithms that achieve it, in particular Nesterov’s acceleration method.

  • Theorem 2 (Lower-bound for smooth strongly convex optimization). For any d2, x(0)Rd, L>0, there exists a μ-strongly convex function f that is C and L-smooth such that any sequences satisfying Assumption 1 is such that for all t<(n1)/2, we have

    (4)x(t)x218(Kf1Kf+1)2tx(0)x2,(5)f(x(t))f(x)μ16(Kf1Kf+1)2tx(0)x2. where x is the unique minimizer of f.

Note that it is common in the litterature to see Theorem 2 without the factor 18. This is due to an artefact of proof since we prove this result in the finite dimensional case whereas Nesterov (2018) works in the infinite dimensional space 2(N). Before proving these important results due to (Nemirovski and Yudin, 1983), we are going to prove several lemmas.

  • Lemma 1 (Minimizers of fk). Let d2, L>0, μ0, then fkL,μ defined in (1) is a μ-strongly convex (eventually convex if μ=0) C-function such that its gradient is L-Lipschitz.

    If μ=0, it has a unique minimizer xk, satisfying

    xik,={1ik+1if 1ik0otherwise,andfkL,0(xk,)=L8(k+1).

    If μ>0, it has a unique minimizer xk, satisfying

    xik,=s2(k+1)s2(k+1)1si+11s2(k+1)si,

    for 1ik, and xik,=0 for i>k, where s=Kf+1Kf1.

  • Proof. We drop the exponents L,μ in the definition of fk=fkL,μ. The function fk being a quadratic form, it is C and its partial derivatives read

    fkxixj(x)=μ1{i=j}+Lμ4{2if i=jk1if j=i1 and 1<ik1if j=i+1 and 1i<k0otherwise.

    Thus, the Hessian matrix is given (for any xRd) by

    2fk(x)=μIdd+Lμ4Lk,

    where Lk is a (thresholded) discrete Laplacian operator with Dirichlet boundary conditions that is tridiagonal

    Lk=(210$0k,dk$12111120dk,k0dk,dk).

    Observe that we have (since fk is a quadratic form)

    fk(x)=122fk(x)x,xLμ4x1+Lμ8. Note that:

    • 1. The Hessian is definite (resp. semi-definite) positive if μ>0 (resp. μ=0). Indeed,

      2fk(x)h,h=μh2+Lμ4Lkh,h. Since Lkh,h=h12+i=1k(hi+1hi)2+hk20 for any h, the result follows depending on the value of μ.

    • 2. Since (ab)22a2+2b2, we have

      h12+i=1k(hi+1hi)2+hk2h12+i=1k(2hi+12+2hi2)+hk24i=1khi24i=1dhi2=4h2.

      Hence,

      2fk(x)h,hμh2+(Lμ)h2=Lh2.

    Thus, we have μId2fk(x)LId.

    Let us characterize the (unique) solution xk, of the minimization of fk over Rd. We aim to solve fk(xk,)=0 to find a critical point (which will be a minimum since we just proved that the Hessian is at least semidefinite positive), that is

    μxk,+Lμ4Lkxk,Lμ4e1=0.

    Projecting this relation on each coordinate 2ik1, we get that

    xi1k,+2xik,xi1k,=4μLμxik,,

    which leads to

    xik,=12L+μLμ(xi+1k,+xi1k,).

    Similarly, we have

    x1k,=12L+μLμ(x2k,+1)andxkk,=12L+μLμxk1k,.

    Consider y0,,yk+1 defined by yi=xik, for 1ik and y0=1 and yk+1=0. We have the relation

    yi=α(yi+1+yi1)whereα=12L+μLμ>0.

    We can rewrite it as the second-order linear recursion yi+2α1yi+1+yi=0. The associated trinom is P=X2α1X+1R[X] whose discriminant is given by

    Δ=(α1)24=16Lμ(Lμ)2.

    We distinguish two cases:

    • 1. If μ=0, then the unique root is given by r=1.

    • 2. If μ>0, then the roots are given by

      r=12(α1Δ)=Lμ1Lμ+1=Kf1Kf+1s=12(α1+Δ)=Kf+1Kf1=1r.

    Case μ=0. We have the affine relation yi=(a+bi)r with constraints y0=a=1 and yk+1=a+b(k+1)=0. In turn, we have yi=1ik+1 and thus

    xik,={1ik+1if 1ik0otherwise.

    The associated optimal value is given by

    fk(xk,)=L8((1k+1)+i=1k11(k+1)2+(1kk+1)2)=L8k+1(k+1)2=L81k+1.

    Case μ>0. The solution can be written as

    yi=ari+bsiwith{a+b=1ark+1+bsk+1=0.

    Thus, we have b=1a, hence aa1=s2(k+1)>0, and in turn we have

    a=s2(k+1)s2(k+1)1andb=11s2(k+1).

    Hence,

    yi=s2(k+1)s2(k+1)1si+11s2(k+1)si.

We now turns to the proofs of Theorem 1 and Theorem 2.

  • Proof of Theorem 1. We restrict our attention the the case where x0=0 w.l.o.g. Indeed, if x00, we can set xf~(x)=f(x+x0) and the following proof carry on. Let d=2k+1 and set f=f2k+1L,0.

    Remark that

    f(x(t))=f2k+1L,0(x(t))=ftL,0(x(t))ft.

    Using Lemma 1, we have on one hand that ft=L8(t+1), and then

    f(x(t))f(x)=L8(k+1)L82˙(k+1)=L16(k+1).

    On the other hand,

    x2k+1,x02=x2k+1,2=i=12k+1(x2k+1,)i2=i=12k+1(1i2(k+1))2=i=12k+1122(k+1)i=12k+1i+14(k+1)2i=12k+1i2=(2k+1)1k+12(k+1)(2k+1)2+14(k+1)22(k+1)(4k+3)(2k+1)6=13(2k+1)(4k+3)4(k+1)2k+1323(k+1). Thus,

    f(x(t))f(x)x2k+1,x02L16(k+1)23(k+1)=3L32(k+1)2,

    that proves (2).

  • Proof of Theorem 2. The proof follows the same strategy as before, but we start with a bound on the iterates instead of the objective function. Assume that x(0)=0, otherwise let f~=f(+x0). Consider d=2k+1 and f=f2k+1L,μ. We rewrite the coordinate of x2k+1, as

    xi2k+1,=s4(k+1)s4(k+1)1si+11s4(k+1)si=si(1s2i1s4(k+1)1).

    On one hand, we have:

    x(0)x2k+1,2=i=12k+1(xi2k+1,)2=i=12k+1s2i(1s2i1s4(k+1)1)2i=12k+1s2i,

    where we used that for all 1i2k+1, we have

    01s2i1s4(k+1)11.

    Bounding the tail of the geometric sums, we obtain

    (6)x(0)x2k+1,22i=1k+1s2i.

    On the other hand, observe that for t<k+1, one has x(t)Rk,2k+1, thus

    x(t)x2k+1,2i=k+12k+1(xi2k+1,)2=i=k+12k+1s2i(1s2i1s4(k+1)1)2.

    Since s>0, we have

    1s2is4(k+1)11s2(k+1)s4(k+1)1,

    and in turn,

    1(1s2i1s4(k+1)1)2(1s2(k+1)1s4(k+1)1)20.

    Thus,

    x(k)x2k+1,2(1s2(k+1)1s4(k+1)1)2i=k+12k+1s2i(7)=(1s2(k+1)1s4(k+1)1)2s2ki=1k+1s2i. Observe that

    (1s2(k+1)1s4(k+1)1)214.

    Combining it with (6) and (7), we have

    x(t)x2k+1,218s2kx(0)x2k+1,218(Kf1Kf+1)2tx(0)x2t+1,2,

    proving (4). The value bound (5) is obtained by applying the following inequality

    f(x)f(x)+μ2xx2,

    to f.

More refined versions of these lower bounds, which provide tighter constants, can be found in Drori and Taylor (2022). However, these improvements come at the price of significantly more involved and technical proofs.

References

  • Drori, Yoel and Adrien Taylor (Feb. 1, 2022). “On the Oracle Complexity of Smooth Strongly Convex Minimization”. In: Journal of Complexity 68, p. 101590.

  • Nemirovski, Arkadi and David Yudin (1983). Problem Complexity and Method Efficiency in Optimization. Wiley. 388 pp.

  • Nesterov, Yurii (2018). Lectures on Convex Optimization. Vol. 137. Springer.