Introduction to lower bounds in optimization

Most of the bounds that are described in the optimization litterature are upper bounds of the form $\|x^{(t)}-x^{\star}\|\leq\alpha(t)$ or $f(x^{(t)})-f(x^{\star})\leq\alpha(t)$ . But what about findings the reverse-side inequality $\|x^{(t)}-x^{\star}\|\geq\beta(t)$ ? Said otherwise, what can we achieve with a “gradient-descent-like” algorithm?

To formalize this notion, we consider algorithm, here sequences $(x^{(t)})_{t\geq 0}$ , that build upon the previous iterates with only access to a first-order oracle:

Assumption 1 (First-order method).

We assume that a first-order method is given by a sequence $x^{(t)}$ such that

x^{(t)}\in x^{(0)}+\operatorname{Span}\{\nabla f(x^{(0)}),\dots,\nabla f(x^{(t% -1)})\}.

We shall note that one can thinks of a more general way to define first-order methods, but for the sake of the results we aim to prove, such level of generality is enough.

With this assumption in mind, how to design a function adversarial to these type of schemes? The idea is to find a function such that the gradient at step $t-1$ gives minimal information, i.e., it has a minimal nonzero partial derivatives. A way to define such function is to “stack” quadratic functions with increasing dependencies between variables:

f_{k}^{L,\mu}(x)=\frac{L-\mu}{8}\left((x_{1}-1)^{2}+\sum_{i=1}^{k-1}(x_{i+1}-x% _{i})^{2}+x_{k}^{2}\right)+\frac{\mu}{2}\|x\|^{2},

(1)

where $0\leq\mu<L$ and $0\leq k\leq d$ .

\frac{\partial f_{k}^{L,\mu}}{\partial x_{i}}(x)=\mu x_{i}+\frac{L-\mu}{4}% \begin{cases}-x_{2}+2x_{1}-1&\text{if }i=1\\ -x_{i+1}+2x_{i}-x_{i-1}&\text{if }2\leq i<k\\ 2x_{k}-x_{k-1}&\text{if }i=k\\ 0&\text{otherwise.}\end{cases}

Set $f=f_{k}^{L,\mu}$ . Observe that if we start from $x^{(0)}=0$ , then

x^{(1)}=x^{(0)}-\eta\nabla f(0)=-\eta\frac{L-\mu}{4}e_{1}\in\mathbb{R}e_{1},

that is only the first coordinate is updated after one iteration. What happens now that we have access to $x^{(0)}$ , $\nabla f_{k}(x^{(0)})$ and $\nabla f_{k}(x^{(1)})$ ? An algorithm satisfying Assumption 1, we look at

x^{(2)}=x^{0}+\alpha\nabla f(x^{(0)})+\beta\nabla f(x^{(1)}).

One can check that for any $(\alpha,\beta)$ , $x^{(2)}\in\mathbb{R}e_{1}+\mathbb{R}e_{2}$ , and by an easy induction, we have $x^{(t)}\in\sum_{k=1}^{t}\mathbb{R}e_{k}$ : any first-order methods will only be able to update at most one new coordinate at each iteration. We are going to prove the following result.

Theorem 1 (Lower-bound for smooth convex optimization).

For any $d\geq 2$ , $x^{(0)}\in\mathbb{R}^{d}$ , $L>0$ , $t\leq(n-1)/2$ , there exists a convex function $f$ that is $C^{\infty}$ and $L$ -smooth such that any sequences satisfying Assumption 1 is such that

\displaystyle f(x^{(t)})-f(x^{\star})

\displaystyle\geq\frac{3L\|x^{(0)}-x^{\star}\|^{2}}{32(t+1)^{2}}

(2)

where $x^{\star}$ is a minimizer of $f$ .

Remark that the rate $1/t^{2}$ is not achieved by the gradient descent¹¹ 1 There exist algorithms that achieve it, in particular Nesterov’s acceleration method.! We also have a lower bound for the class of strongly convex functions.

Theorem 2 (Lower-bound for smooth strongly convex optimization).

For any $d\geq 2$ , $x^{(0)}\in\mathbb{R}^{d}$ , $L>0$ , there exists a $\mu$ -strongly convex function $f$ that is $C^{\infty}$ and $L$ -smooth such that any sequences satisfying Assumption 1 is such that for all $t<(n-1)/2$ , we have

	$\displaystyle\\|x^{(t)}-x^{\star}\\|^{2}$	$\displaystyle\geq\frac{1}{8}\left(\frac{\sqrt{K_{f}}-1}{\sqrt{K_{f}}+1}\right)% ^{2t}\\|x^{(0)}-x^{\star}\\|^{2},$		(4)
	$\displaystyle f(x^{(t)})-f(x^{\star})$	$\displaystyle\geq\frac{\mu}{16}\left(\frac{\sqrt{K_{f}}-1}{\sqrt{K_{f}}+1}% \right)^{2t}\\|x^{(0)}-x^{\star}\\|^{2}.$		(5)

where $x^{\star}$ is the unique minimizer of $f$ .

Note that it is common in the litterature to see Theorem 2 without the factor $\frac{1}{8}$ . This is due to an artefact of proof since we prove this result in the finite dimensional case whereas Nesterov (2018) works in the infinite dimensional space $\ell^{2}(\mathbb{N})$ . Before proving these important results due to (Nemirovski and Yudin, 1983), we are going to prove several lemmas.

Lemma 1 (Minimizers of $f_{k}$ ).

Let $d\geq 2$ , $L>0$ , $\mu\geq 0$ , then $f_{k}^{L,\mu}$ defined in (1) is a $\mu$ -strongly convex (eventually convex if $\mu=0$ ) $C^{\infty}$ -function such that its gradient is $L$ -Lipschitz.

If $\mu=0$ , it has a unique minimizer $x^{k,\star}$ satisfying

x_{i}^{k,\star}=\begin{cases}1-\frac{i}{k+1}&\text{if }1\leq i\leq k\\ 0&\text{otherwise,}\end{cases}\quad\text{and}\quad f_{k}^{L,0}(x^{k,\star})=% \frac{L}{8(k+1)}.

If $\mu>0$ , it has a unique minimizer $x^{k,\star}$ satisfying

x_{i}^{k,\star}=\frac{s^{2(k+1)}}{s^{2(k+1)}-1}s^{-i}+\frac{1}{1-s^{2(k+1)}}s^% {i},

for $1\leq i\leq k$ , and $x_{i}^{k,\star}=0$ for $i>k$ , where $s=\frac{\sqrt{K_{f}}+1}{\sqrt{K_{f}}-1}$ .

Proof.

We drop the exponents $L,\mu$ in the definition of $f_{k}=f_{k}^{L,\mu}$ . The function $f_{k}$ being a quadratic form, it is $C^{\infty}$ and its partial derivatives read

\frac{\partial f_{k}}{\partial x_{i}\partial x_{j}}(x)=\mu 1_{\{i=j\}}+\frac{L% -\mu}{4}\begin{cases}2&\text{if }i=j\leq k\\ -1&\text{if }j=i-1\text{ and }1<i\leq k\\ -1&\text{if }j=i+1\text{ and }1\leq i<k\\ 0&\text{otherwise.}\end{cases}

Thus, the Hessian matrix is given (for any $x\in\mathbb{R}^{d}$ ) by

\nabla^{2}f_{k}(x)=\mu\operatorname{Id}_{d}+\frac{L-\mu}{4}L_{k},

where $L_{k}$ is a (thresholded) discrete Laplacian operator with Dirichlet boundary conditions that is tridiagonal

L_{k}=\left(\begin{array}[]{ccccc|c}2&-1&0&&&0_{k,d-k}\\ -1&2&-1&&&\\ &-1&\ddots&\ddots&&\\ &&\ddots&\ddots&-1&\\ &&&-1&2&\\ \hline\cr 0_{d-k,k}&0_{d-k,d-k}\end{array}\right).

Observe that we have (since $f_{k}$ is a quadratic form)

\displaystyle f_{k}(x)

\displaystyle=\frac{1}{2}\langle\nabla^{2}f_{k}(x)x,x\rangle-\frac{L-\mu}{4}x_% {1}+\frac{L-\mu}{8}.

Note that:

1.

The Hessian is definite (resp. semi-definite) positive if $\mu>0$ (resp. $\mu=0$ ). Indeed,

$\displaystyle\langle\nabla^{2}f_{k}(x)h,h\rangle$ $\displaystyle=\mu\|h\|^{2}+\frac{L-\mu}{4}\langle L_{k}h,h\rangle.$

Since $\langle L_{k}h,h\rangle=h_{1}^{2}+\sum_{i=1}^{k}(h_{i+1}-h_{i})^{2}+h_{k}^{2}\geq 0$ for any $h$ , the result follows depending on the value of $\mu$ .
2.

Since $(a-b)^{2}\leq 2a^{2}+2b^{2}$ , we have

$h_{1}^{2}+\sum_{i=1}^{k}(h_{i+1}-h_{i})^{2}+h_{k}^{2}\leq h_{1}^{2}+\sum_{i=1}% ^{k}(2h_{i+1}^{2}+2h_{i}^{2})+h_{k}^{2}\leq 4\sum_{i=1}^{k}h_{i}^{2}\leq 4\sum% _{i=1}^{d}h_{i}^{2}=4\|h\|^{2}.$

Hence,

$\langle\nabla^{2}f_{k}(x)h,h\rangle\leq\mu\|h\|^{2}+(L-\mu)\|h\|^{2}=L\|h\|^{2}.$

Thus, we have $\mu\operatorname{Id}\preceq\nabla^{2}f_{k}(x)\preceq L\operatorname{Id}$ .

Let us characterize the (unique) solution $x^{k,\star}$ of the minimization of $f_{k}$ over $\mathbb{R}^{d}$ . We aim to solve $\nabla f_{k}(x^{k,\star})=0$ to find a critical point (which will be a minimum since we just proved that the Hessian is at least semidefinite positive), that is

\mu x^{k,\star}+\frac{L-\mu}{4}L_{k}x^{k,\star}-\frac{L-\mu}{4}e_{1}=0.

Projecting this relation on each coordinate $2\leq i\leq k-1$ , we get that

-x_{i-1}^{k,\star}+2x_{i}^{k,\star}-x_{i-1}^{k,\star}=-\frac{4\mu}{L-\mu}x_{i}% ^{k,\star},

which leads to

x_{i}^{k,\star}=\frac{1}{2}\frac{L+\mu}{L-\mu}(x_{i+1}^{k,\star}+x_{i-1}^{k,% \star}).

Similarly, we have

x_{1}^{k,\star}=\frac{1}{2}\frac{L+\mu}{L-\mu}(x_{2}^{k,\star}+1)\quad\text{% and}\quad x_{k}^{k,\star}=\frac{1}{2}\frac{L+\mu}{L-\mu}x_{k-1}^{k,\star}.

Consider $y_{0},\dots,y_{k+1}$ defined by $y_{i}=x_{i}^{k,\star}$ for $1\leq i\leq k$ and $y_{0}=1$ and $y_{k+1}=0$ . We have the relation

y_{i}=\alpha(y_{i+1}+y_{i-1})\quad\text{where}\quad\alpha=\frac{1}{2}\frac{L+% \mu}{L-\mu}>0.

We can rewrite it as the second-order linear recursion $y_{i+2}-\alpha^{-1}y_{i+1}+y_{i}=0$ . The associated trinom is $P=X^{2}-\alpha^{-1}X+1\in\mathbb{R}[X]$ whose discriminant is given by

\Delta=(-\alpha^{-1})^{2}-4=16\frac{L\mu}{(L-\mu)^{2}}.

We distinguish two cases:

1.

If $\mu=0$ , then the unique root is given by $r=1$ .
2.

If $\mu>0$ , then the roots are given by

$\displaystyle r$ $\displaystyle=\frac{1}{2}(\alpha^{-1}-\sqrt{\Delta})=\frac{\sqrt{\frac{L}{\mu}% }-1}{\sqrt{\frac{L}{\mu}}+1}=\frac{\sqrt{K_{f}}-1}{\sqrt{K_{f}}+1}$

$\displaystyle s$ $\displaystyle=\frac{1}{2}(\alpha^{-1}+\sqrt{\Delta})=\frac{\sqrt{K_{f}}+1}{% \sqrt{K_{f}}-1}=\frac{1}{r}.$

Case $\mu=0$ . We have the affine relation $y_{i}=(a+bi)r$ with constraints $y_{0}=a=1$ and $y_{k+1}=a+b(k+1)=0$ . In turn, we have $y_{i}=1-\frac{i}{k+1}$ and thus

x_{i}^{k,\star}=\begin{cases}1-\frac{i}{k+1}&\text{if }1\leq i\leq k\\ 0&\text{otherwise.}\end{cases}

The associated optimal value is given by

f_{k}(x^{k,\star})=\frac{L}{8}\left(\left(-\frac{1}{k+1}\right)+\sum_{i=1}^{k-% 1}\frac{1}{(k+1)^{2}}+\left(1-\frac{k}{k+1}\right)^{2}\right)=\frac{L}{8}\frac% {k+1}{(k+1)^{2}}=\frac{L}{8}\frac{1}{k+1}.

Case $\mu>0$ . The solution can be written as

y_{i}=ar^{i}+bs^{i}\quad\text{with}\quad\begin{cases}a+b&=1\\ ar^{k+1}+bs^{k+1}&=0.\end{cases}

Thus, we have $b=1-a$ , hence $\frac{a}{a-1}=s^{2(k+1)}>0$ , and in turn we have

a=\frac{s^{2(k+1)}}{s^{2(k+1)}-1}\quad\text{and}\quad b=\frac{1}{1-s^{2(k+1)}}.

Hence,

y_{i}=\frac{s^{2(k+1)}}{s^{2(k+1)}-1}s^{-i}+\frac{1}{1-s^{2(k+1)}}s^{i}.

∎

We now turns to the proofs of Theorem 1 and Theorem 2.

Proof of Theorem 1.

We restrict our attention the the case where $x_{0}=0$ w.l.o.g. Indeed, if $x_{0}\neq 0$ , we can set $x\mapsto\tilde{f}(x)=f(x+x_{0})$ and the following proof carry on. Let $d=2k+1$ and set $f=f_{2k+1}^{L,0}$ .

Remark that

f(x^{(t)})=f_{2k+1}^{L,0}(x^{(t)})=f_{t}^{L,0}(x^{(t)})\geq f_{t}^{\star}.

Using Lemma 1, we have on one hand that $f_{t}^{\star}=\frac{L}{8(t+1)}$ , and then

f(x^{(t)})-f(x^{\star})=\frac{L}{8(k+1)}-\frac{L}{8\dot{2}(k+1)}=\frac{L}{16(k% +1)}.

On the other hand,

	$\displaystyle\\|x^{2k+1,\star}-x_{0}\\|^{2}=\\|x^{2k+1,\star}\\|^{2}=\sum_{i=1}^{2% k+1}(x^{2k+1,\star})_{i}^{2}$
	$\displaystyle=\sum_{i=1}^{2k+1}\left(1-\frac{i}{2(k+1)}\right)^{2}$
	$\displaystyle=\sum_{i=1}^{2k+1}1-\frac{2}{2(k+1)}\sum_{i=1}^{2k+1}i+\frac{1}{4% (k+1)^{2}}\sum_{i=1}^{2k+1}i^{2}$
	$\displaystyle=(2k+1)-\frac{1}{k+1}\frac{2(k+1)(2k+1)}{2}+\frac{1}{4(k+1)^{2}}% \frac{2(k+1)(4k+3)(2k+1)}{6}$
	$\displaystyle=\frac{1}{3}\frac{(2k+1)(4k+3)}{4(k+1)}$
	$\displaystyle\leq\frac{2k+1}{3}\leq\frac{2}{3}(k+1).$

Thus,

\frac{f(x^{(t)})-f(x^{\star})}{\|x^{2k+1,\star}-x_{0}\|^{2}}\geq\frac{\frac{L}% {16(k+1)}}{\frac{2}{3}(k+1)}=\frac{3L}{32(k+1)^{2}},

that proves (2).

∎

Proof of Theorem 2.

The proof follows the same strategy as before, but we start with a bound on the iterates instead of the objective function. Assume that $x^{(0)}=0$ , otherwise let $\tilde{f}=f(\cdot+x_{0})$ . Consider $d=2k+1$ and $f=f_{2k+1}^{L,\mu}$ . We rewrite the coordinate of $x^{2k+1,\star}$ as

x_{i}^{2k+1,\star}=\frac{s^{4(k+1)}}{s^{4(k+1)}-1}s^{-i}+\frac{1}{1-s^{4(k+1)}% }s^{i}=s^{-i}\left(1-\frac{s^{2i}-1}{s^{4(k+1)}-1}\right).

On one hand, we have:

\|x^{(0)}-x^{2k+1,\star}\|^{2}=\sum_{i=1}^{2k+1}(x_{i}^{2k+1,\star})^{2}=\sum_% {i=1}^{2k+1}s^{-2i}\left(1-\frac{s^{2i}-1}{s^{4(k+1)}-1}\right)^{2}\leq\sum_{i% =1}^{2k+1}s^{-2i},

where we used that for all $1\leq i\leq 2k+1$ , we have

0\leq 1-\frac{s^{2i}-1}{s^{4(k+1)}-1}\leq 1.

Bounding the tail of the geometric sums, we obtain

\|x^{(0)}-x^{2k+1,\star}\|^{2}\leq 2\sum_{i=1}^{k+1}s^{-2i}.

(6)

On the other hand, observe that for $t<k+1$ , one has $x^{(t)}\in\mathbb{R}^{k,2k+1}$ , thus

\|x^{(t)}-x^{2k+1,\star}\|^{2}\geq\sum_{i=k+1}^{2k+1}(x_{i}^{2k+1,\star})^{2}=% \sum_{i=k+1}^{2k+1}s^{-2i}\left(1-\frac{s^{2i}-1}{s^{4(k+1)}-1}\right)^{2}.

Since $s>0$ , we have

\frac{1-s^{2i}}{s^{4(k+1)}-1}\leq\frac{1-s^{2(k+1)}}{s^{4(k+1)}-1},

and in turn,

1\geq\left(1-\frac{s^{2i}-1}{s^{4(k+1)}-1}\right)^{2}\geq\left(1-\frac{s^{2(k+% 1)}-1}{s^{4(k+1)}-1}\right)^{2}\geq 0.

Thus,

	$\displaystyle\\|x^{(k)}-x^{2k+1,\star}\\|^{2}$	$\displaystyle\geq\left(1-\frac{s^{2(k+1)}-1}{s^{4(k+1)}-1}\right)^{2}\sum_{i=k% +1}^{2k+1}s^{-2i}$
		$\displaystyle=\left(1-\frac{s^{2(k+1)}-1}{s^{4(k+1)}-1}\right)^{2}s^{-2k}\sum_% {i=1}^{k+1}s^{-2i}.$		(7)

Observe that

\left(1-\frac{s^{2(k+1)}-1}{s^{4(k+1)}-1}\right)^{2}\geq\frac{1}{4}.

Combining it with (6) and (7), we have

\|x^{(t)}-x^{2k+1,\star}\|^{2}\geq\frac{1}{8}s^{-2k}\|x^{(0)}-x^{2k+1,\star}\|% ^{2}\geq\frac{1}{8}\left(\frac{\sqrt{K_{f}}-1}{\sqrt{K_{f}}+1}\right)^{2t}\|x^% {(0)}-x^{2t+1,\star}\|^{2},

proving (4). The value bound (5) is obtained by applying the following inequality

f(x)\geq f(x^{\star})+\frac{\mu}{2}\|x-x^{\star}\|^{2},

to $f$ . ∎

More refined versions of these lower bounds, which provide tighter constants, can be found in Drori and Taylor (2022). However, these improvements come at the price of significantly more involved and technical proofs.

References

Y. Drori and A. Taylor (2022) On the oracle complexity of smooth strongly convex minimization. Journal of Complexity 68, pp. 101590. External Links: Document Cited by: Introduction to lower bounds in optimization.
A. Nemirovski and D. Yudin (1983) Problem Complexity and Method Efficiency in Optimization. Wiley. Cited by: Introduction to lower bounds in optimization.
Y. Nesterov (2018) Lectures on convex optimization. Vol. 137, Springer. Cited by: Introduction to lower bounds in optimization.

	$\displaystyle r$	$\displaystyle=\frac{1}{2}(\alpha^{-1}-\sqrt{\Delta})=\frac{\sqrt{\frac{L}{\mu}% }-1}{\sqrt{\frac{L}{\mu}}+1}=\frac{\sqrt{K_{f}}-1}{\sqrt{K_{f}}+1}$
	$\displaystyle s$	$\displaystyle=\frac{1}{2}(\alpha^{-1}+\sqrt{\Delta})=\frac{\sqrt{K_{f}}+1}{% \sqrt{K_{f}}-1}=\frac{1}{r}.$