Following the steepest direction
February 8th, 2024
This content is available as a PDF here.
Contents
Suppose that you want to solve the problem
and you want to achieve a
A naive way to do so is to consider a discretization of
and
The issue is that the size of
This is known as the curse of dimensionality, a term introduced by (Bellman, Richard E., 1961) [Chapter V] for optimal control and (over)used since then in the optimization, statistics and machine learning communities (Donoho, David L, 2000). Summary: we need to be (slightly) more clever.
1 Descent direction
Consider the class of algorithms of the form
how to “optimaly choose” the direction
Observe that if
-
Proof. Let
such that . Consider the function defined byThe function
is differentiable at and its derivative reads . Denoting the angle between and , we have that . Hence, minimizing is equivalent to finding the minimum of , that is for some . Thus, we have□
2 The gradient descent (GD) algorithm
Proposition 1 gives birth to what is known as the gradient descent algorithm or gradient descent method.
-
Require: Initialization
, step-size policy . -
for
do -
-
end for
The choice of the step-sizes
-
• Predetermined. In this case, the sequence
is chosen beforehand, either with a constant step size or with a given function of , , e.g., for some . For some classes of optimization problem, it is possible to provide guarantees depending on the choice of . -
• Oracle. This is the “optimal” descent that a gradient descent method can produce, defined by
Remark that this choice is only theoretical since it involves a new optimization problem that may be nonsolvable in closed form.
-
• Backtracking rule. Another time…
3 Rates of convegence
Means of convergence
In optimization, we consider several types of (worst-case) convergences:
-
• Convergence in norms of the iterates.
Assuming that there exists a minimizer
Note that there may be several minimizers, but we only consider one given minimzier.
-
• Convergence to the set of minimizers.
Here, we look at the distance to any minimizer:
where
-
• Convergence in value (0-order).
Assuming that the infimum of the problem is given by
and assuming
-
• First-order convergence.
If
The idea is to use the first-order condition, and say that the more
Rates versus complexity
To be more precise, we consider
-
• Sublinear rates.
We say that
-
• Linear rates.
We say that
-
• Quadratic rates.
We say that
4 Explicit analysis of the gradient descent: ordinary least-squares
This part is heavily inspired from (Bach, Francis, 2021, chapter V). Let
The objective
and its Hessian is constant on the whole space:
Note that there is at least one minimizer to the problem (1). The first-order condition for a potential minimizer
A necessary, and sufficient condition, to have exactly one minimizer is that
The following lemma shows that, providing the knowledge of
-
Proof. Using the expression of the gradient, we have for any minimizer
Substracting from both side givesThis identity is linear recursion, and unrolling it leads to the result. □
Convergence in norm
We denote by
Using (2), we have
Remark that the eigenvalues of
Suppose that there is multiple minimizers. In this case,
Hence, the method is not necessarly convergent, depending on the initialization.
The situation is different when we assume that there is a unique minimizer.
-
Proposition 2. Assume that (1) has a unique minimizer
. Then, the gradient descent iterates with constant step-size converges linearly in norm
-
Proof. Since there is a unique minimizer
, we have . Observe that (using a spectral norm bound)Observe that
is minimized for , and value is . Indeed,Hence,
Moreover, we have the following set of equivalence
This minima is achieved when the two curve and intersect. □
Choice independant from the strong-convexity constant
In both cases, we obtained a linear (or geometric, or exponential depending on the community) rate
where
It is possible to also speaks in term of iteration complexity. Assuming the choice of
Thus, to obtain a fraction
Convergence in value
As we say, if
-
Proof. Observe that
has an exact Taylor expansion of order 2. Indeed, for any and any minimizer :Using the (exact) Taylor expansion (3) of
, we haveUsing (2), we have
Since
is symmetric, we haveUsing the fact that for a symmetric matrix, we have
, we haveUsing again (3), we have
Case
. Now, we have according to the previous section, the following linear rate.Case
. Let’s study the eigenvalues of [gradientdescent-1] . We keep our analysis restricted to . Hence, we obtain the claimed rate. □
[gradientdescent-1]: Note the presence of
5 Convergence of gradient descent for nonconvex functions
Unfortunately, not every functions are quadratic, and we need to deep diver in the analysis of the gradient descent algorithm to understand its behavior on generic function. We are going to explore several class of functions that leads to different rates, either in norm or in objective values. We start with some “weak” results on function with limited regularity and without convexity assumptions.
Coercive continuously differentiable function
The following proposition show a very weak result without any quantification of the rate of convergence of gradient descent.
-
Proof. Step 1. The value sequence is (strictly) decreasing (unless it reaches a critical point). We recall that the oracle GD chose the step-size
asHence, evaluating
at and 0 leads toSuppose that
. Thus, 0 is a minimizer of and therefore its first-order condition reads . Given , we haveEvaluating this expression at
gives us , hence that means is a critical point.Step 2. The iterates sequence is converging (up to a subsequence). Consider the set
. Remark that:-
•
is defined by an inequality , thus is closed; -
•
is coercive, thus is bounded.
Hence,
is compact. Observe that for all , . Using Bolzano–Weierstrass theorem, we get that the sequence has a convergent subsequence towards some . Let us denote it by where is increasing.Step 3. The limit of the iterates sequence is a critical point. Suppose that
is not a critical point. Using Step 1, we now that defined as , where is the oracle step, is such that . Consider the sequence converging to . Remark that, using the continuity of the map for a contiously differentiable function , we haveBy definition of
, we also have for all thatHence,
Using (3), we get that
, that is a contradiction. In conclusion, is a critical point. □ -
Remark that if in addition
Reminders on -smooth functions
Recall that a (full-domain) differentiable function
-
Proposition 5. Let
a -smooth function. Then, for all ,Moreover, (4) is equivalent to
Smooth function bounded from below
The previous result was rather weak. Consider now a function
-
Proof. From Nesterov p.29
Step 1. Bound on one step of gradient descent. For
, let . Using Proposition 5, we have that To ensure that , we need to require that that is . We can parameterize as with . Hence, we get the descentStep 2. Convergence of 1st-order optimality. Applying (6) to
and , we get thatSumming inequality (7) for
, we obtainObserving that the r.h.s. telescopes, we have
by definition of
. Since (8) is true for all , we get converges towards 0.Step 3. Rate of convergence. Using the fact that for all
, we have , we have from (8) thatHence, the claimed result. □
Note that we cannot say anything in this context about the rate of convergence of
6 Convergence of gradient descent for convex functions
Co-coercivity of the gradient of a convex -smooth function
Convex
Convex -smooth function
When
-
Proof. From Nesterov p.80
Let
a minimizer of . We consider the lack of optimality 1. We have Using the co-coercivity of the gradient (Proposition 7), we have thatThus,
In particular, we get that
for all .Using the smoothness of
, Proposition 5 gives us the bound Using again the co-coercivity of the gradient, we have thatNow, let
. Using the differential characterization of convexity, we have
Using Cauchy–Schwarz inequality, we have
Hence, we have
where
. Multiplying both side by , we get thatSince
is nonincreasing, we have , henceSumming
inequalities of this type, we obtainWe conclude using a bit of computation:
Replacing
, and by their expression gives the result.Maximizing the descent
gives the optimal step-size . Injecting this value, we get thatUsing the smoothness Proposition 5, we have
Thus,
which is the claimed result. □
Strongly convex -smooth function
When
We now prove the linear convergence of gradient descent when dealing with strongly convex functions.
Bach, Francis (2021). Learning Theory from First Principles.
Bellman, Richard E. (1961). Adaptive Control Processes, Princeton University Press.
Donoho, David L (2000). High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality, Citeseer.