Following the steepest direction

February 8th, 2024

This content is available as a PDF here.

Contents

Suppose that you want to solve the problem

minx[1,1]df(x),

and you want to achieve a ϵ-precision on the objective function f, i.e., you want to obtain an estimate x^Rd of x=0 such that x^xϵ. Since f is continuous, you cannot have access to all values f(x) when x describes the constraints [1,1]d, but you can allow yourself multiples calls to the “oracle” f(x) in order to achieve this precision.

A naive way to do so is to consider a discretization of [1,1] of precision ϵ, that is

Gϵ={kϵk{ϵ1,,ϵ1}},

and Gϵd=Gϵ××Gϵ its counterpart in dimension d. It is easy to see that taking x^ the minimizers of f over Gϵd satisfies the property x^xLϵ as soon as f is L-Lipchitz. So what’s the deal?

The issue is that the size of Gϵd grows exponentially fast with the dimension d. In dimension 1, one needs to perform 2ϵ1 evaluations of f, but in dimension d, one needs 2ϵ1d! This quantity gets prohibilitively large in a too fast manner: say you look at a very modest precision of ϵ=102, then in dimension d=1, only 200 computations of f are needed, whereas in dimension d=10, you already need 200000000000000000000 evaluations (that’s 20 zeros).

This is known as the curse of dimensionality, a term introduced by (Bellman, Richard E., 1961) [Chapter V] for optimal control and (over)used since then in the optimization, statistics and machine learning communities (Donoho, David L, 2000). Summary: we need to be (slightly) more clever.

1 Descent direction

Consider the class of algorithms of the form

x(t+1)=x(t)+η(t)u(t+1),

how to “optimaly choose” the direction u(t+1) such that x(t+1) is closer to a minimum of f?

  • Definition 1. Let f:RdR. A descent direction uRd of f at xRd is vector such that

    ϵ>0,η(0,ϵ),f(x+ηu)<f(x).

(image)

Figure 1: Descent cone.

Observe that if f is differentiable, this property is equivalent to f(x),u<0. The negative gradient f(x¯) is the direction of steepest descent at the point x¯.

  • Proposition 1. Let f:RdR be a differentiable function at x¯Rd. Then, the problem

    minu=1f(x¯),u

    has a unique solution u=f(x¯)f(x¯).

  • Proof. Let uRd such that u=1. Consider the function ϕu:RR defined by

    ϕu(t)=f(x¯+ηu).

    The function ϕ is differentiable at 0 and its derivative reads ϕu(0)=f(x¯),u. Denoting θ the angle between f(x¯) and u, we have that ϕu(0)=f(x¯)cosθ. Hence, minimizing f(x¯),u is equivalent to finding the minimum of cosθ, that is θ=(2k+1)π for some kZ. Thus, we have

    u=f(x¯)f(x¯) and f(x¯),u=f(x¯).

2 The gradient descent (GD) algorithm

Proposition 1 gives birth to what is known as the gradient descent algorithm or gradient descent method.

  • Require: Initialization x(0)Rd, step-size policy η(t)>0.

  • for t=1, do

  • x(t+1)x(t)η(t)f(x(t))

  • end for

The choice of the step-sizes η(k) is crucial, and there is several way to do it.

  • Predetermined. In this case, the sequence (η(t)) is chosen beforehand, either with a constant step size η(t)=ηR>0 or with a given function of t, η(t)=g(t), e.g., g(t)=η(k+1)1/2 for some η>0. For some classes of optimization problem, it is possible to provide guarantees depending on the choice of g.

  • Oracle. This is the “optimal” descent that a gradient descent method can produce, defined by

ηoracle(t)=argminη0f(x(t)ηf(x(t))).

Remark that this choice is only theoretical since it involves a new optimization problem that may be nonsolvable in closed form.

  • Backtracking rule. Another time…

3 Rates of convegence

Means of convergence

In optimization, we consider several types of (worst-case) convergences:

  • Convergence in norms of the iterates.

Assuming that there exists a minimizer x, the convergence in norms (for a given norm, typically the Euclidean one) looks at an ϵ-solution in the sense of

r(t)=x(t)xϵ.

Note that there may be several minimizers, but we only consider one given minimzier.

  • Convergence to the set of minimizers.

Here, we look at the distance to any minimizer:

r(t)=d(x(t),S)=infxSx(t)xϵ,

where S is the set of minimizer.

  • Convergence in value (0-order).

Assuming that the infimum of the problem is given by f=infxdomff(x), an ϵ-solution is given by

r(t)=f(x(t))fϵ,

and assuming x is a solution, this is the quantity r(t)=f(x(t))f(x).

  • First-order convergence.

If f is differentiable, one can looks at the gradient of the iterates as

r(t)=f(x(t))ϵ.

The idea is to use the first-order condition, and say that the more f(x(t)), the closer we are from a critical point of f if f is nonconvex, and closer from a global minima if f is convex.

Rates versus complexity

To be more precise, we consider R-rate below, there is an alternative definition (Q-convergence).

  • Sublinear rates.

We say that r(t) enjoys a sublinear rate if it has an upper bound written as a power function of t, i.e. r(t)ctα where c>0 and α>0. Typical examples are stochastic gradient descent enjoying a O(1/t) rate for convex functions or the gradient descent enjoying a O(1/t) rate for convex L-smooth functions. In term of complexity, if r(t) enjoys a ctα rate, its complexity bound is (cϵ)1/α. For instance, the SGD for convex functions has complexity O((cϵ)2).

  • Linear rates.

We say that r(t) enjoys a linear rate if it has an upper bound written as an exponential function of t, i.e. r(t)ceqt for c>0 and q(0,1]. Note that typically, we don’t obtain directly the exponential function in proofs, but an even more precise one: r(t)c(1q)t that is directly bounded by ceqt. In term of complexity, we have the bound 1q(logc+log1ϵ).

  • Quadratic rates.

We say that r(t) enjoys a quadratic rate if it has an upper bound written as r(t)c(r(t))2 for c>0. In term of complexity, we have the bound loglog1ϵ: each iteration doubles the precision of the estimate! Note that quadratic rates are a special case of a wider class known as superlinear convergence.

4 Explicit analysis of the gradient descent: ordinary least-squares

This part is heavily inspired from (Bach, Francis, 2021, chapter V). Let n,p1, ARn×p and yRn. The least-square estimator is defined by the problem

(1)argminxRpf(x)=12nAxy22.

The objective f is twice differentiable, its gradient reads

f(x)=1nA(Axy)Rp,

and its Hessian is constant on the whole space:

2f(x)=H=1nAARp×p.

Note that there is at least one minimizer to the problem (1). The first-order condition for a potential minimizer x is given by

Hx=1nAy.

A necessary, and sufficient condition, to have exactly one minimizer is that H is invertible.

The following lemma shows that, providing the knowledge of H, the $t$-iterates of the gradient descent with fixed step-size is computable in closed-form.

  • Lemma 1. For all tN, we have

    (2)x(t)x=(IηH)t(x(0)x).

  • Proof. Using the expression of the gradient, we have for any minimizer x

    x(t+1)=x(t)ηnA(Ax(t)y)=x(t)ηH(x(t)x). Substracting x from both side gives

    x(t+1)x=(IηH)(x(t)x).

    This identity is linear recursion, and unrolling it leads to the result.

Convergence in norm

We denote by μ the smallest eigenvalue of H, L its largest, and we let κ=Lμ its condition number. We have 0μL and κ1.

Using (2), we have

x(t)x2=x(0)x,(IηH)2t(x(0)x).

Remark that the eigenvalues of (IηH)2t are exactly given by (1ηλ)2t where λ is an eigenvalue of H. Since all eigenvalues of H are contained in [μ,L], one also can bound any ρeigen((IηH)2t) by

|ρ|(maxλ[μ,L]|1ηλ|)2t.

Suppose that there is multiple minimizers. In this case, H is not invertible and the smallest eigenvalue μ is equal to 0. Hence, there exists v such that (IηH)v=v. Choose x a minimizer. Using x(0)=v+x, we have

x(t)x2=v+xx,(IηH)2t(v+xx)=v,(IηH)2tv=v22.

Hence, the method is not necessarly convergent, depending on the initialization.

The situation is different when we assume that there is a unique minimizer.

  • Proposition 2. Assume that (1) has a unique minimizer x. Then, the gradient descent iterates x(t) with constant step-size η(t)=η=2μ+L converges linearly in norm

    x(t)x2(κ1κ+1)2tx(0)x2.

  • Proof. Since there is a unique minimizer x, we have μ>0. Observe that (using a spectral norm bound)

    x(0)x,(IηH)2t(x(0)x)(maxλ[μ,L]|1ηλ|)2tx(0)x22.

    Observe that maxλ[μ,L]|1ηλ| is minimized for η=2μ+L, and value is κ1κ+1(0,1). Indeed,

    maxλ[μ,L]|1ηλ|=max(|1ημ|,|1ηL|).

    Hence,

    minη>0maxλ[μ,L]|1ηλ|=minη>0max(|1ημ|,|1ηL|).

    Moreover, we have the following set of equivalence

    |1ηL||1ημ|(1ηL)2(1ημ)2ηL22Lημ22μη2L+μ. This minima is achieved when the two curve |1ηL| and |1ημ| intersect.

Choice independant from the strong-convexity constant μ: It is possible to chose a smaller stepsize (hence slower) by taking the suboptimal value η=1/L. In this case, we get

maxλ[μ,L]|1ηλ|=1μL=11κ.

In both cases, we obtained a linear (or geometric, or exponential depending on the community) rate

x(t)x2c2tx(0)x2,

where c=κ1κ+1 or c=11κ depending on the choice of the step-size.

It is possible to also speaks in term of iteration complexity. Assuming the choice of η=1/L, observe that

(11κ)2texp(1κ)2t=exp(2tκ).

Thus, to obtain a fraction ϵx(0)x2 it is necessary to perform t iterations defined by

ϵ=exp(2tκ)t=κ2log1ϵ.

Convergence in value

As we say, if μ=0, one cannot establish the convergence of the iterates. But the story is different for the convergence in value.

  • Proposition 3. Let x any solution of (1). The gradient descent iterates x(t) with constant step-size η(t)=η=1L converges with a O(1/t) rate

    f(x(t))f(x)14tηx(0)x22.

    If moreover, (1) has a unique solution, the convergence is linear:

    f(x(t))f(x)(11κ)2t(f(x(0))f(x)).

  • Proof. Observe that f has an exact Taylor expansion of order 2. Indeed, for any xRp and any minimizer x:

    (3)f(x)f(x)=f(x),xx+12xx,H(xx)=12xx,H(xx).

    Using the (exact) Taylor expansion (3) of f, we have

    f(x(t))f(x)=12x(t)x,H(x(t)x).

    Using (2), we have

    f(x(t))f(x)=12(IηH)t(x(0)x),H(IηH)t(x(0)x).

    Since (IηH) is symmetric, we have

    f(x(t))f(x)=12x(0)x,(IηH)2tH(x(0)x).

    Using the fact that for a symmetric matrix, we have Au,vAspu,v, we have

    f(x(t))f(x)(IηH)2t12x(0)x,H(x(0)x).

    Using again (3), we have

    f(x(t))f(x)(IηH)2t(f(x(0))f(x)).

    Case μ>0. Now, we have according to the previous section, the following linear rate.

    f(x(t))f(x)(11κ)2t(f(x(0))f(x)).

    Case μ=0. Let’s study the eigenvalues of [gradientdescent-1] (IηH)2tH. We keep our analysis restricted to η1L.

    |λ(1ηλ)2t|λexp(ηλ)2t=λexp(2tηλ)=12tη2tηλexp(2tηλ)insert the term 2tη12tηsupρ0ρexp(ρ)crude bound=12etηmaximum at ρ=114tηe>2. Hence, we obtain the claimed O(1/t) rate.

[gradientdescent-1]: Note the presence of H after (IηH)2t.

5 Convergence of gradient descent for nonconvex functions

Unfortunately, not every functions are quadratic, and we need to deep diver in the analysis of the gradient descent algorithm to understand its behavior on generic function. We are going to explore several class of functions that leads to different rates, either in norm or in objective values. We start with some “weak” results on function with limited regularity and without convexity assumptions.

Coercive continuously differentiable function

The following proposition show a very weak result without any quantification of the rate of convergence of gradient descent.

  • Proposition 4. Convergence of GD for coercive C1 functions

    Let f:RdR be a coercive function and assume moreover that it is continuously differentiable. For any initial guess x(0), the oracle GD iterates x(t) converge up to a subsequence towards a critical point of f.

  • Proof. Step 1. The value sequence is (strictly) decreasing (unless it reaches a critical point). We recall that the oracle GD chose the step-size η(t) as

    η(t)=argminη0ϕ(t)(η):=f(x(t)ηf(x(t))).

    Hence, evaluating ϕ(t) at η(t) and 0 leads to

    f(x(t+1))=ϕ(t)(η(t))ϕ(t)(0)=f(x(t)).

    Suppose that f(x(t+1))=f(x(t)). Thus, 0 is a minimizer of ϕ(t) and therefore its first-order condition reads ddηϕ(t)(0)=0. Given η0, we have

    ddηϕ(t)(η)=f(x(t)),f(x(t)ηf(x(t))).

    Evaluating this expression at η=0 gives us f(x(t)),f(x(t))=0, hence f(x(t))=0 that means x(t) is a critical point.

    Step 2. The iterates sequence is converging (up to a subsequence). Consider the set S={xRdf(x)f(x(0))}. Remark that:

    • S is defined by an inequality , thus S is closed;

    • f is coercive, thus S is bounded.

    Hence, S is compact. Observe that for all t0, x(t)S. Using Bolzano–Weierstrass theorem, we get that the sequence (x(t)) has a convergent subsequence towards some x. Let us denote it by (x(ψ(t)))t0 where ψ:NN is increasing.

    Step 3. The limit of the iterates sequence is a critical point. Suppose that x is not a critical point. Using Step 1, we now that x defined as x=xηf(x)x, where η>0 is the oracle step, is such that f(x)<f(x). Consider the sequence y(t)=x(ψ(t)) converging to x. Remark that, using the continuity of the map zzηf(z) for a contiously differentiable function f, we have

    (3)limk+(y(t)ηf(y(t)))=xηf(x)=x.

    By definition of ϕ(t), we also have for all t0 that

    ϕ(t)(η)ϕ(t)(η(t))f(x).

    Hence,

    f(y(t)ηf(y(t)))f(x)

    Using (3), we get that f(x)f(x), that is a contradiction. In conclusion, x is a critical point.

Remark that if in addition f has a unique critical point, i.e., a unique global minimizer, then x(t) converges to this global minimizer.

  • Corollary 1. Let f:RdR be a coercive, continuously differentiable function with a unique global minimizer x. For any initial guess x(0), the oracle GD iterates x(t) converge towards x

Reminders on L-smooth functions

Recall that a (full-domain) differentiable function f:RnR is said to be L-smooth if f is L-Lipschitz. L-smoothness is important in optimization because it ensures a quadratic upper bound of the function.

  • Proposition 5. Let f a L-smooth function. Then, for all x,yRn,

    (4)f(x)f(y),xyLxy2.

    Moreover, (4) is equivalent to

    (5)f(y)f(x)+f(x),yx+L2xy2,x,yRn.

Smooth function bounded from below

The previous result was rather weak. Consider now a function f:RdR that is L-smooth, but not necessarly convex. In addition, we suppose that f has full domain and is bounded from below on Rd. The following proposition gives a O(1/T) rate for the 1st-order optimality using a constant step-size. A similar proof is doable for the oracle or the backtracking gradient descent.

  • Proposition 6. Convergence of GD for L-smooth functions

    Let f:RdR be a L-smooth function, bounded from below by f¯R. For any initial guess x(0), the GD iterates x(t) with step-size η(t)=η<2L converge towards a critical point of f, and moreover

    min0tTf(x(t))1T+1(ω1L(f(x(0)f¯)))1/2,

    where ω=2α(1α) and η=2αL.

  • Proof. From Nesterov p.29

    Step 1. Bound on one step of gradient descent. For xRd, let y=xηf(x). Using Proposition 5, we have that

    f(y)f(x)+f(x),yx+L2yx2=f(x)+f(x),ηf(x)+L2ηf(x)2=f(x)ηf(x)2+Lη22f(x)2=f(x)η(1Lη2)f(x)2. To ensure that f(y)f(x), we need to require that η(1Lη2)0 that is η2L. We can parameterize η as η=2αL with α[0,1). Hence, we get the descent

    (6)f(x)f(y)2α(1α)Lf(x)2.

    Step 2. Convergence of 1st-order optimality. Applying (6) to x=x(t) and y=x(t+1), we get that

    (7)f(x(t))f(x(t+1))2α(1α)Lf(x(t))2.

    Summing inequality (7) for t=0T, we obtain

    t=0Tf(x(t))f(x(t+1))2α(1α)Lt=0Tf(x(t))2.

    Observing that the r.h.s. telescopes, we have

    (8)2α(1α)Lt=0Tf(x(t))2f(x(0))f(x(T+1))f(x(0))f¯,

    by definition of f¯. Since (8) is true for all TN, we get f(x(t)) converges towards 0.

    Step 3. Rate of convergence. Using the fact that for all t=0T, we have f(x(t))min0tTf(x(t)), we have from (8) that

    2α(1α)L(T+1)min0tTf(x(t))2f(x(0))f¯,

    Hence, the claimed result.

Note that we cannot say anything in this context about the rate of convergence of f(x(t)) or (x(t))! One can show that η=1L is the optimal rate with respect to the bound in (6).

6 Convergence of gradient descent for convex functions

Co-coercivity of the gradient of a convex L-smooth function

Convex L-smooth functions are co-coercive. This result is sometimes coined Baillon–Haddad theorem [@baillon1977quelquesproprietes], but one shall note that the original contributionis much more general than its application to the gradient.

  • Proposition 7. Let f be a full-domain convex L-smooth function. Then, f is 1/L-co-coercive, i.e.,

    x,yRn,f(x)f(y),xy1Lf(x)f(y)2.

Convex L-smooth function

When f is supposed to be convex, we can have a rate of convergence in objective values.

  • Proposition 8. Rate of GD for convex L-smooth functions

    Let f:RdR be a convex L-smooth function. For any initial guess x(0), the GD iterates x(t) with step-size η(t)=η(0,2L) converge towards an optimal point x of f, and moreover

    f(x(t))f2(f(x0)f)x(0)x2tη(2Lη)(f(x0)f)+2x(0)x2.

    Moreover, if η=1L, then

    f(x(t))f2Lx(0)x2t+4.

  • Proof. From Nesterov p.80

    Let x a minimizer of f. We consider r(t)=x(t)x the lack of optimality 1. We have

    (r(t+1))2=x(t)xηf(x(t))2=(r(t))22ηf(x(t)),x(t)x+η2f(x(t)) Using the co-coercivity of the gradient (Proposition 7), we have that

    f(x(t))f(x)=0,x(t)x1Lf(x(t))f(x)=02.

    Thus,

    (r(t+1))2(r(t))2η(2Lη)f(x(t))2.

    In particular, we get that r(t)r(0) for all t0.

    Using the smoothness of f, Proposition 5 gives us the bound

    f(x(t+1))f(x(t+1))+f(x(t)),x(t+1)x(t)+L2x(t+1)x(t)2 Using again the co-coercivity of the gradient, we have that

    f(x(t+1))f(x(t))η(1Lη2)f(x(t))2.

    Now, let δ(t)=f(x(t+1))f(x). Using the differential characterization of convexity

    x,x¯C,f(x)f(x¯)+f(x¯),xx¯.

    , we have

    δ(t)f(x(t)),x(t)x.

    Using Cauchy–Schwarz inequality, we have

    δ(t)f(x(t))x(t)x=r(t)f(x(t))r(0)f(x(t)).

    Hence, we have

    δ(t+1)δ(t)q(r(0))2(δ(t))2,

    where q=η(1Lη2). Multiplying both side by 1δ(t)δ(t+1)>0, we get that

    1δ(t)1δ(t+1)q(r(0))2δ(t)δ(t+1).

    Since (f(x(t))) is nonincreasing, we have δ(t)δ(t+1)1, hence

    1δ(t+1)1δ(t)+q(r0)2.

    Summing T1 inequalities of this type, we obtain

    1δ(T)1δ(0)+Tq(r0)2.

    We conclude using a bit of computation:

    δ(T)(1δ(0)+Tq(r0)2)1=((r0)2+Tqδ(0)δ(0)(r(0))2)1=δ(0)(r(0))2(r0)2+Tqδ(0)

    Replacing q, r(0) and δ(0) by their expression gives the result.

    Maximizing the descent η(2Lη) gives the optimal step-size η=1L. Injecting this value, we get that

    δ(T)2Lδ(0)(r(0))22L(r(0))2+Tδ(0).

    Using the smoothness Proposition 5, we have

    f(x(0))f+f(x),x(0)x+L2(r(0))2=f+L2(r(0))2

    Thus,

    2L(r(0))2+Tδ(0)(4+T)δ(0),

    which is the claimed result.

1 Note that we do not assume that x(t) converges towards this specific x.

Strongly convex L-smooth function

When f is both L-smooth and μ-strongly convex, we have the following “strenghened” co-coercivity.

  • Proposition 9. Let f be a L-smooth and μ-strongly convex function. For any x,yRn, we have

    f(x)f(y),xyμLμ+Lxy2+1μ+Lf(x)f(y)2.

We now prove the linear convergence of gradient descent when dealing with strongly convex functions.

  • Proposition 10. Rate of GD for strongly convex smooth functions

    Let f:RdR be a μ-strongly convex L-smooth function. For any initial guess x(0), the GD iterates x(t) with step-size t(t)=η(0,2μ+L] converge towards an optimal point of f, and moreover

    x(t)x2(12ημLμ+L)tx(0)x2.

    Moreover, if η=2μ+L, then

    x(t)x2(Kf1Kf+1)2tx(0)x2f(x(t))fL2(Kf1Kf+1)2tx(0)x2, where Kf=L/μ.

  • Proof. From nesterov p.101 Let r(t)=x(t)x. We start like the proof of Proposition 8:

    (r(t+1))2=x(t)xηf(x(t))2=(r(t))22ηf(x(t)),x(t)x+η2f(x(t)) Using Proposition 9, we have

    f(x(t))f(x),x(t)xμLμ+Lx(t)x2+1μ+Lf(x(t))f(x)2,

    that is

    f(x(t))f(x),x(t)xμLμ+L(r(t))2+1μ+Lf(x(t))2.

    Hence,

    (r(t+1))2(12ημLμ+L)(r(t))2+η(η2μ+L)f(x(t))2. Since η2μ+L<0, we can drop it and we obtain the recursion (r(t+1))2(1q)(r(t))2 with q=2ημLμ+L, that proves the first claim (linear rate).

    Let η=2μ+L. We have

    1q=14μL(μ+L)2=(Lμ)2(L+μ)2=(Kf1Kf+1)2.

    The last inequality is yet another use of Proposition 5.

Bach, Francis (2021). Learning Theory from First Principles.

Bellman, Richard E. (1961). Adaptive Control Processes, Princeton University Press.

Donoho, David L (2000). High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality, Citeseer.