Adding some noise

February 9th, 2024

This content is available as a PDF here.

Contents

(image)

Figure 1: Fundamental paper of stochastic approximation by (Robbins, Herbert and Monro, Sutton, 1951)

Stochastic gradient descent (SGD) is the workhorse of modern machine learning. Almost all recent achievements ( 2010s) in computer vision, natural language processing are somehow connected to the use of SGD (or one of its multiple generalization). Despite this modern aspect, SGD takes its root in the seminal paper of (Robbins, Herbert and Monro, Sutton, 1951), more than 70 years ago. The key insight of Robbins and Monro is that first orders method are robust. What does it means? The gradient descent does not need an exact computation of the gradient, it only need that in average the computation is correct up to some “stochastic” errors. We will formalize this idea in this chapter. again inspired by the lecture notes of (Bach, Francis, 2021). A key insight from the seminal paper by (Bottou, Léon and Bousquet, Olivier, 2007) is that in machine learning, high precision is not required since it is meaningless to optimize below the statistical error. This paper considerably relaunched the interest towards SGD, especially for empirical risk minimization.

1 Stochastic approximation

Let (Ω,A,P) a probability space. We consider the following stochastic approximation problem

(1)xargminxRdf(x):=EξP[F(x;ξ)].

Observe that (1) encompasses two important problem of supervised machine learning: expected risk minimization and empirical risk minimization. Let us recall basic fact on risk minimization. Our goal is given observations 1 (couple of features and responses) (ci,ri)C×R, for i=1,,N, the goal is to predict a new response rR given a new feature fC. Consider now the test distribution P on Ω=C×R and a loss function :R×RR.

  • Expected risk minimization. The expected risk (or simply risk) of a predictor ϕ:CR is the quantity

R(ϕ)=E(c,r)P[(r,ϕ(c))]=C×R(r,ϕ(c))dP(c,r).

This minimization problem fits the framework of (1) with x=(c,r), F= and P is the joint probability on F×R.

  • Empirical risk minimization. The practical counterpart of the expected risk is the empirical risk defined as

R^(ϕ)=1Ni=1N(ri,f(ci)).

Again, this minimization problem fits the framework of (1) with x=(c,r), F=, but now P is the empirical distribution based on the test data (ci,ri).

1 Variables c and r are typically denoted x and y in a ML courses, but we want to keep them free for the optimization variable.

2 Stochastic gradient descent

We suppose that we have access to stochastic estimation gt+1(xt) of the gradient f(x(t)), i.e.,

(2)x(t+1)=x(t)η(t)g(t+1)(x(t);ξ(t)).

The stochastic gradient algorithm is NOT a descent method in the sense of of the oracle g(t+1)(x(t);ξ(t)) is not necessarly a descent direction as defined for instance in our lecture on the gradient descent.

Two situations:

  • Stochastic approximation. Here we have access to an additive noisy version of the “true” gradient:

g(t)(x;ξ)=f(x)+ξ.

  • Learning from samples. On the contrary, here we have for instance P that is a uniform choice between i1,,N:

g(t)(x;ξ)=fξ(x),

where ξU(1,,N).

The standard assumption when dealing with SGD is the access to an unbiased estimator of the gradient.

  • Asssumption 1. Unbiasedness of the stochastic gradient estimate

    (3)Eξ[gt+1(x(t);ξ)|x(t)]=f(x(t)).

Note that it is possible to deal with biased oracles, but this goes beyond the limit of this course.

We are going to study the convergence of SGD in expectation, that is the main way to analyze in the machine learning community the stochastic convergence. Nevertheless, the stochastic approximation literature, starting from (Robbins, H. and Siegmund, D., 1971), are often concerned with almost sure convergence based on martingale tools.

For convex function, we will assume that the noisy gradient g(t) are bounded almost-surely.

  • Asssumption 2. Bounded gradient

    (4)B>0,tN,g(t+1)(xt;ξ)22B2P-almost surely.

The simplest strategy to choose the step size, as for the batch gradient descent, is to fix some η(t)=η>0. Unfortunately, this method does not converge towards a minimizers of (1)! It is nevertheless, the most frequent use of SGD in practice. Why? Because, despite the nonconvergence, it is possible to show that the iterations are “close enough” to the true minimizer in the following sense (even if f is not convex).

  • Theorem 1. Convergence up to a ball of SGD

    Assume that f is $L$-smooth and (3),(4) are satisfied. For an initial guess x(0), the SGD iterates x(t) with constant step-size η(t)=η satisfy

    1TE[t=0T1f(x(t))2]f(x(0))fTη+ηLB22,

    where f is the optimal value of (1).

A consequence is that when T goes to +, the expected squared-norm of the gradient of a point uniformly chosen from the trajectory of SGD lies in a ball of radius ηLB22η centered in 0.

This is illustrated in the video below

The situation is radically different from the batch mode where the asymptotic regime produce a true solution.

  • Proof. Proof of Theorem 1

    This is a slight adaptation of the descent lemma from the batch gradient descent proof. We start from

    f(x(t+1))f(x(t))ηg(t+1)(xt;ξ(t)),f(x(t))+η2L2g(t+1)(xt;ξ(t)2.

    In the light of (3), (4), we have thus taking the expectation

    E[f(x(t+1))]E[f(x(t))]ηE[f(x(t))2]+η2LB22.

    Telescoping the difference between f(x(t+1)) and f(x(t)), we have

    E[f(x(T))f(x(0))]Tη2LB22ηt=0T1f(x(t))2.

    We obtain the claimed result by dividing by Tη and observing that f(x(T))f.

3 Convergence rate in the convex setting

We now dive away from the constant stepsize framework, and will study what happens when η(t) varies accross iterations. The following theorem gives the O(1/t) rate of SGD in the convex case.

  • Theorem 2. Rate of (2) for convex functions

    Assume that f is convex. For an initial guess x(0), suppose that f has a minimizer x satisfying

    xx(0)D,

    for some D>0. Assume that (3), (4) are satisfied. Then, the SGD iterates x(t) with step-size η(t)=DBt satisfŷ[Note that f is not random, but x¯(t) is, so that E[x¯(t)] is meaningful.]

    E[f(x¯(t))f(x)]DB2+log(t)2t,

    where x¯(t) is the averaged version of x(t), i.e.,

    x¯(t)=1k=1tη(k)k=1tη(k)x(k1).

  • Proof. Let t1. We have the following decomposition

    x(t+1)x2=|xtη(t)g(t+1)(x(t);ξ(t))x2=x(t)x22η(t)g(t+1)(x(t);ξ(t)),x(t)x+(η(t))2g(t+1)(x(t);ξ(t))2. Using the linearity of the expectation, we get

    E[x(t+1)x2]=E[x(t)x2]2η(t)E[g(t+1)(x(t);ξ(t)),x(t)x]+(η(t))2E[g(t+1)(x(t);ξ(t))2]. We keep the first term as it is, the last term is bounded by

    (η(t))2E[g(t+1)(x(t);ξ(t))2](η(t))2B2a.s.(bounded gradient) To bound the middle term, we use the total law of conditional expectation

    E[g(t+1)(x(t);ξ(t)),x(t)x]=E[E[g(t+1)(x(t);ξ(t)),x(t)x|x(t)]](total law)=E[E[g(t+1)(x(t);ξ(t))|x(t)],x(t)x](linearity of cond. expectation)=E[f(x(t)),x(t)x](unbiasedness)E[f(x(t))f(x)], where the last inequality comes from the differential characterization of a convex function. Combining these three terms gives

    (5)E[x(t+1)x2]E[x(t)x2]2η(t)E[f(x(t))f(x)]+(η(t))2B2.

    Reorganizing the terms, we have

    η(t)E[f(x(t))f(x)]12(E[x(t)x2]E[x(t+1)x2])+12(η(t))2B2.

    We observe that summing over the iterations t=0T1 leads to a natural telescoping sum on the first term of the right-hand side:

    t=0T1η(t)E[f(x(t))f(x)]12(E[x(0)x2]E[x(T)x2])+12B2t=0T1(η(t))2.

    Since E[x(T)x2] is nonnegative, we have the cruder bound

    t=0T1η(t)E[f(x(t))f(x)]12x(0)x2+12B2t=0T1(η(t))2.

    Dividing by t=0T1η(t), we obtain

    (6)1t=0T1η(t)t=0T1η(t)E[f(x(t))f(x)]x(0)x22t=0T1η(t)+B2t=0T1(η(t))22t=0T1η(t).

    The left-hand side of (6) is an upper-bound of E[f(x¯(t))f(x)]. Indeed,

    1t=0T1η(t)t=0T1η(t)E[f(x(t))f(x)]=E[1t=0T1η(t)t=0T1η(t)(f(x(t))f(x))]=E[t=0T1η(t)t=0T1η(t)f(x(t))f(x)]E[f(t=0T1η(t)t=0T1η(t)x(t))f(x)]=E[f(x¯(t))f(x)], where the inequality comes from the Jensen’s inequality.

    We now look at an upper-bound of the right-hand side. We need two things:

    • The sum t=0T1η(t) needs to diverge to + so that the first term goes to 0.

    • The sum t=0T1η(t) needs to diverge to + more “quickly” that the sum of squares t=0T1(η(t))2 so that the second term goes also to 0.

    A typical example of such a sequence is η(t)=α(t+1)1/2 for some α>0. Indeed, we have

    x(0)x22t=0T1η(t)+B2t=0T1(η(t))22t=0T1η(t)D2+B2α2t=1Tt12αt=1Tt1/2.

    Using the fact that

    t=1Tt1/2t=1TT1/2=TT=T,

    we have that

    D2+B2α2t=1Tt12αt=1Tt1/212αT(D2+B2α2t=1Tt1).

    We bound the harmonic serie t=1Tt1 by 1+logT, hence

    x(0)x22t=0T1η(t)+B2t=0T1(η(t))22t=0T1η(t)12αT(D2+B2α2(1+logT)).

    We obtain the claimed bound by setting α=D/B.

Several remarks needs to be stated:

  • As claimed before, SGD is not a descent method. Moreover, we proved here the convergence in expectation of the value function of the averaged iterates x(t). Note that almost-sure convergence can be proved, but with more involved arguments, see for instance (Sebbouh, Othmane and Gower, Robert M. and Defazio, Aaron, 2021) and references therein.

  • It is possible to replace the noisy gradient step by a projected noisy gradient step, i.e., using the algorithm defined by

x(t+1)=ΠB(x(0),D)(x(t)η(t)g(t+1)(x(t);ξ(t))).

The obtained results are exactly the same.

  • It is also possible to remove the assumption that x(0) lives in a neighborhood of an optimal solution x and requires instead that F is B-smooth. Note that in this case, we do not obtain better results (contrary to the nonstochastic setting).

  • The bound can be shown to optimal (Agarwal, Alekh and Bartlett, Peter L. and Ravikumar, Pradeep and Wainwright, Martin J., 2012, Theorem 1) (at least in the Lipschitz setting).

4 Minibatch Stochastic Gradient Descent

The minibatch idea is to make a tradeoff between the fast rate of the gradient descent but linear depency on N, and the slow rate of SGD but dimension-free dependency: we reduce the variance of the noise, but have a higher running time. Minibatching can also take advantage of parallel architectures such as GPU or SIMD instructions on CPU. A minibatch is built upon B call to the stochastic oracle, and its SGD step associated is written as

x(t+1)=x(t)η(t)Mm=1Mg(t+1)(x(t);ξb(t))where(ξ1(t),,ξM(t))M(M,N).

Observing that g~(t+1)(x,ξξ)=1Mm=1Mg(t+1)(x;ξm) is an unbiased estimator of the gradient of f, indeed

Eξξ[g~(t+1)(x(t);ξξ)|x(t)]=Eξξ[1Mm=1Mg(t+1)(x(t);ξm)|x(t)]=1Mm=1MEξξ[g(t+1)(x(t);ξm)|x(t)]by linearity=1Mm=1Mf(x(t))=f(x(t))unbiasedness. The minibatch gradient is also bounded, indeed

g~(t+1)(x(t);ξξ)22=1Mm=1Mg(t+1)(x(t);ξm)221M2m=1Mg(t+1)(x(t);ξm)22B2MP-almost surely. We can hence apply Theorem 2. A lot of effort has been done in the last year to determine what is the optimal minibatch size, see e.g., (Gower, Robert Mansel and Loizou, Nicolas and Qian, Xun and Sailanbayev, Alibek and Shulgin, Egor and Richtárik, Peter, 2019), or to derive schedule policy with increasing size of the minibatch.

Agarwal, Alekh and Bartlett, Peter L. and Ravikumar, Pradeep and Wainwright, Martin J. (2012). Information-Theoretic Lower Bounds on the Oracle Complexity of Stochastic Convex Optimization.

Bach, Francis (2021). Learning Theory from First Principles.

Bottou, Léon and Bousquet, Olivier (2007). The Tradeoffs of Large Scale Learning, Curran Associates, Inc..

Gower, Robert Mansel and Loizou, Nicolas and Qian, Xun and Sailanbayev, Alibek and Shulgin, Egor and Richtárik, Peter (2019). SGD: General Analysis and Improved Rates, PMLR.

Robbins, H. and Siegmund, D. (1971). A CONVERGENCE THEOREM FOR NON NEGATIVE ALMOST SUPERMARTINGALES AND SOME APPLICATIONS**Research Supported by NIH Grant 5-R01-GM-16895-03 and ONR Grant N00014-67-A-0108-0018., Academic Press.

Robbins, Herbert and Monro, Sutton (1951). A Stochastic Approximation Method, Institute of Mathematical Statistics.

Sebbouh, Othmane and Gower, Robert M. and Defazio, Aaron (2021). Almost Sure Convergence Rates for Stochastic Gradient Descent and Stochastic Heavy Ball, PMLR.