Stochastic gradient descent (SGD) is the workhorse of modern machine learning. Almost all recent achievements ( 2010s) in computer vision, natural language processing are somehow connected to the use of SGD (or one of its multiple generalization). Despite this modern aspect, SGD takes its root in the seminal paper of (Robbins, Herbert and Monro, Sutton, 1951), more than 70 years ago. The key insight of Robbins and Monro is that first orders method are robust. What does it means? The gradient descent does not need an exact computation of the gradient, it only need that in average the computation is correct up to some “stochastic” errors. We will formalize this idea in this chapter. again inspired by the lecture notes of (Bach, Francis, 2021). A key insight from the seminal paper by (Bottou, Léon and Bousquet, Olivier, 2007) is that in machine learning, high precision is not required since it is meaningless to optimize below the statistical error. This paper considerably relaunched the interest towards SGD, especially for empirical risk minimization.
1 Stochastic approximation
Let a probability space. We consider the following stochastic approximation problem
| (1) |
Observe that (1) encompasses two important problem of supervised machine learning: expected risk minimization and empirical risk minimization. Let us recall basic fact on risk minimization. Our goal is given observations 11 1 Variables and are typically denoted and in a ML courses, but we want to keep them free for the optimization variable. (couple of features and responses) , for , the goal is to predict a new response given a new feature . Consider now the test distribution on and a loss function .
-
•
Expected risk minimization. The expected risk (or simply risk) of a predictor is the quantity
This minimization problem fits the framework of (1) with , and is the joint probability on .
-
•
Empirical risk minimization. The practical counterpart of the expected risk is the empirical risk defined as
Again, this minimization problem fits the framework of (1) with , , but now is the empirical distribution based on the test data .
2 Stochastic gradient descent
We suppose that we have access to stochastic estimation of the gradient , i.e.,
| (2) |
The stochastic gradient algorithm is NOT a descent method in the sense of of the oracle is not necessarly a descent direction as defined for instance in our lecture on the gradient descent.
Two situations:
-
•
Stochastic approximation. Here we have access to an additive noisy version of the “true” gradient:
-
•
Learning from samples. On the contrary, here we have for instance that is a uniform choice between :
where .
The standard assumption when dealing with SGD is the access to an unbiased estimator of the gradient.
Asssumption 1.
Unbiasedness of the stochastic gradient estimate
| (3) |
Note that it is possible to deal with biased oracles, but this goes beyond the limit of this course.
We are going to study the convergence of SGD in expectation, that is the main way to analyze in the machine learning community the stochastic convergence. Nevertheless, the stochastic approximation literature, starting from (Robbins, H. and Siegmund, D., 1971), are often concerned with almost sure convergence based on martingale tools.
For convex function, we will assume that the noisy gradient are bounded almost-surely.
Asssumption 2.
Bounded gradient
| (4) |
The simplest strategy to choose the step size, as for the batch gradient descent, is to fix some . Unfortunately, this method does not converge towards a minimizers of (1)! It is nevertheless, the most frequent use of SGD in practice. Why? Because, despite the nonconvergence, it is possible to show that the iterations are “close enough” to the true minimizer in the following sense (even if is not convex).
Theorem 1.
Convergence up to a ball of SGD
A consequence is that when goes to , the expected squared-norm of the gradient of a point uniformly chosen from the trajectory of SGD lies in a ball of radius centered in 0.
This is illustrated in the video below
<video autoplay muted src="adding-some-noise-anim_iterates.mp4" type="video/mp4" />
The situation is radically different from the batch mode where the asymptotic regime produce a true solution.
Proof.
Proof of Theorem 1
3 Convergence rate in the convex setting
We now dive away from the constant stepsize framework, and will study what happens when varies accross iterations. The following theorem gives the rate of SGD in the convex case.
Theorem 2.
Rate of (2) for convex functions
Proof.
Let . We have the following decomposition
Using the linearity of the expectation, we get
We keep the first term as it is, the last term is bounded by
| (bounded gradient) |
To bound the middle term, we use the total law of conditional expectation
| (total law) | ||||
| (linearity of cond. expectation) | ||||
| (unbiasedness) | ||||
where the last inequality comes from the differential characterization of a convex function. Combining these three terms gives
| (5) |
Reorganizing the terms, we have
We observe that summing over the iterations leads to a natural telescoping sum on the first term of the right-hand side:
Since is nonnegative, we have the cruder bound
Dividing by , we obtain
| (6) |
The left-hand side of (6) is an upper-bound of . Indeed,
where the inequality comes from the Jensen’s inequality.
We now look at an upper-bound of the right-hand side. We need two things:
-
•
The sum needs to diverge to so that the first term goes to 0.
-
•
The sum needs to diverge to more “quickly” that the sum of squares so that the second term goes also to 0.
A typical example of such a sequence is for some . Indeed, we have
Using the fact that
we have that
We bound the harmonic serie by , hence
We obtain the claimed bound by setting . ∎
Several remarks needs to be stated:
-
•
As claimed before, SGD is not a descent method. Moreover, we proved here the convergence in expectation of the value function of the averaged iterates . Note that almost-sure convergence can be proved, but with more involved arguments, see for instance (Sebbouh, Othmane and Gower, Robert M. and Defazio, Aaron, 2021) and references therein.
-
•
It is possible to replace the noisy gradient step by a projected noisy gradient step, i.e., using the algorithm defined by
The obtained results are exactly the same.
-
•
It is also possible to remove the assumption that lives in a neighborhood of an optimal solution and requires instead that is -smooth. Note that in this case, we do not obtain better results (contrary to the nonstochastic setting).
-
•
The bound can be shown to optimal (Agarwal, Alekh and Bartlett, Peter L. and Ravikumar, Pradeep and Wainwright, Martin J., 2012, Theorem 1) (at least in the Lipschitz setting).
4 Minibatch Stochastic Gradient Descent
The minibatch idea is to make a tradeoff between the fast rate of the gradient descent but linear depency on , and the slow rate of SGD but dimension-free dependency: we reduce the variance of the noise, but have a higher running time. Minibatching can also take advantage of parallel architectures such as GPU or SIMD instructions on CPU. A minibatch is built upon call to the stochastic oracle, and its SGD step associated is written as
Observing that is an unbiased estimator of the gradient of , indeed
| by linearity | ||||
| unbiasedness. | ||||
The minibatch gradient is also bounded, indeed
We can hence apply Theorem 2. A lot of effort has been done in the last year to determine what is the optimal minibatch size, see e.g., (Gower, Robert Mansel and Loizou, Nicolas and Qian, Xun and Sailanbayev, Alibek and Shulgin, Egor and Richtárik, Peter, 2019), or to derive schedule policy with increasing size of the minibatch.
Agarwal, Alekh and Bartlett, Peter L. and Ravikumar, Pradeep and Wainwright, Martin J. (2012). Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization.
Bach, Francis (2021). Learning Theory from First Principles.
Bottou, Léon and Bousquet, Olivier (2007). The Tradeoffs of Large Scale Learning, Curran Associates, Inc..
Gower, Robert Mansel and Loizou, Nicolas and Qian, Xun and Sailanbayev, Alibek and Shulgin, Egor and Richtárik, Peter (2019). SGD: General Analysis and Improved Rates, PMLR.
Robbins, H. and Siegmund, D. (1971). A CONVERGENCE THEOREM FOR NON NEGATIVE ALMOST SUPERMARTINGALES AND SOME APPLICATIONS**Research Supported by NIH Grant 5-R01-GM-16895-03 and ONR Grant N00014-67-A-0108-0018., Academic Press.
Robbins, Herbert and Monro, Sutton (1951). A Stochastic Approximation Method, Institute of Mathematical Statistics.
Sebbouh, Othmane and Gower, Robert M. and Defazio, Aaron (2021). Almost Sure Convergence Rates for Stochastic Gradient Descent and Stochastic Heavy Ball, PMLR.