Adding some noise
February 9th, 2024
This content is available as a PDF here.
Contents
Stochastic gradient descent (SGD) is the workhorse of modern machine learning. Almost all recent achievements (
1 Stochastic approximation
Let
Observe that (1) encompasses two important problem of supervised machine learning: expected risk minimization and empirical risk minimization. Let us recall
basic fact on risk minimization. Our goal is given observations 1 (couple of features and responses)
-
• Expected risk minimization. The expected risk (or simply risk) of a predictor
is the quantity
This minimization problem fits the framework of (1) with
-
• Empirical risk minimization. The practical counterpart of the expected risk is the empirical risk defined as
Again, this minimization problem fits the framework of (1) with
1 Variables
2 Stochastic gradient descent
We suppose that we have access to stochastic estimation
The stochastic gradient algorithm is NOT a descent method in the sense of of the oracle
Two situations:
-
• Stochastic approximation. Here we have access to an additive noisy version of the “true” gradient:
-
• Learning from samples. On the contrary, here we have for instance
that is a uniform choice between :
where
The standard assumption when dealing with SGD is the access to an unbiased estimator of the gradient.
Note that it is possible to deal with biased oracles, but this goes beyond the limit of this course.
We are going to study the convergence of SGD in expectation, that is the main way to analyze in the machine learning community the stochastic convergence. Nevertheless, the stochastic approximation literature, starting from (Robbins, H. and Siegmund, D., 1971), are often concerned with almost sure convergence based on martingale tools.
For convex function, we will assume that the noisy gradient
The simplest strategy to choose the step size, as for the batch gradient descent, is to fix some
A consequence is that when
This is illustrated in the video below
The situation is radically different from the batch mode where the asymptotic regime produce a true solution.
-
Proof. Proof of Theorem 1
This is a slight adaptation of the descent lemma from the batch gradient descent proof. We start from
In the light of (3), (4), we have thus taking the expectation
Telescoping the difference between
and , we haveWe obtain the claimed result by dividing by
and observing that . □
3 Convergence rate in the convex setting
We now dive away from the constant stepsize framework, and will study what happens when
-
Theorem 2. Rate of (2) for convex functions
Assume that
is convex. For an initial guess , suppose that has a minimizer satisfyingfor some
. Assume that (3), (4) are satisfied. Then, the SGD iterates with step-size satisfŷ[Note that is not random, but is, so that is meaningful.]where
is the averaged version of , i.e.,
-
Proof. Let
. We have the following decomposition Using the linearity of the expectation, we get We keep the first term as it is, the last term is bounded by To bound the middle term, we use the total law of conditional expectation where the last inequality comes from the differential characterization of a convex function. Combining these three terms givesReorganizing the terms, we have
We observe that summing over the iterations
leads to a natural telescoping sum on the first term of the right-hand side:Since
is nonnegative, we have the cruder boundDividing by
, we obtainThe left-hand side of (6) is an upper-bound of
. Indeed, where the inequality comes from the Jensen’s inequality.We now look at an upper-bound of the right-hand side. We need two things:
-
• The sum
needs to diverge to so that the first term goes to 0. -
• The sum
needs to diverge to more “quickly” that the sum of squares so that the second term goes also to 0.
A typical example of such a sequence is
for some . Indeed, we haveUsing the fact that
we have that
We bound the harmonic serie
by , henceWe obtain the claimed bound by setting
. □ -
Several remarks needs to be stated:
-
• As claimed before, SGD is not a descent method. Moreover, we proved here the convergence in expectation of the value function of the averaged iterates
. Note that almost-sure convergence can be proved, but with more involved arguments, see for instance (Sebbouh, Othmane and Gower, Robert M. and Defazio, Aaron, 2021) and references therein. -
• It is possible to replace the noisy gradient step by a projected noisy gradient step, i.e., using the algorithm defined by
The obtained results are exactly the same.
-
• It is also possible to remove the assumption that
lives in a neighborhood of an optimal solution and requires instead that is -smooth. Note that in this case, we do not obtain better results (contrary to the nonstochastic setting). -
• The bound can be shown to optimal (Agarwal, Alekh and Bartlett, Peter L. and Ravikumar, Pradeep and Wainwright, Martin J., 2012, Theorem 1) (at least in the Lipschitz setting).
4 Minibatch Stochastic Gradient Descent
The minibatch idea is to make a tradeoff between the fast rate of the gradient descent but linear depency on
Observing that
Agarwal, Alekh and Bartlett, Peter L. and Ravikumar, Pradeep and Wainwright, Martin J. (2012). Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization.
Bach, Francis (2021). Learning Theory from First Principles.
Bottou, Léon and Bousquet, Olivier (2007). The Tradeoffs of Large Scale Learning, Curran Associates, Inc..
Gower, Robert Mansel and Loizou, Nicolas and Qian, Xun and Sailanbayev, Alibek and Shulgin, Egor and Richtárik, Peter (2019). SGD: General Analysis and Improved Rates, PMLR.
Robbins, H. and Siegmund, D. (1971). A CONVERGENCE THEOREM FOR NON NEGATIVE ALMOST SUPERMARTINGALES AND SOME APPLICATIONS**Research Supported by NIH Grant 5-R01-GM-16895-03 and ONR Grant N00014-67-A-0108-0018., Academic Press.
Robbins, Herbert and Monro, Sutton (1951). A Stochastic Approximation Method, Institute of Mathematical Statistics.
Sebbouh, Othmane and Gower, Robert M. and Defazio, Aaron (2021). Almost Sure Convergence Rates for Stochastic Gradient Descent and Stochastic Heavy Ball, PMLR.