Despite the fact that many problems encountered in statistics and machine learning are not convex, convexity is an important to study in order to understand what is the generic behaviour of an optimization procedure.
Convex set
A set \(C \subseteq \Xx\) is convex if every segment of \(C\) is contained in \(C\). It can formalized as the following definition and illustrated in TODO.
Definition 1 (Convex set) A set \(C \subseteq \Xx\) is convex if \[ \forall (x,y) \in C^{2}, \forall \lambda \in [0,1], \quad \lambda x + (1-\lambda) y \in C . \]
One can observe that it is possible to replace \(\lambda \in [0,1]\) by \(\lambda \in (0,1)\) without any difference (prove it!) In most of these lecture notes, the set \(\Xx\) will the Euclidean space \(\mathbb{R}^{p}\), but remark that this notion only requires that \(\Xx\) is a vector space over the real numbers \(\mathbb{R}\), possibly of infinite dimension.
Example 1 (Basic convex sets)
- The convex sets of \(\RR\) are exactly the intervals of \(\RR\).
- Any affine hyperplane \(H = \setcond{x \in \Xx}{\varphi(x) = \alpha}\) where \(\Xx\) is a real vector space, \(\varphi\) a linear functional and \(\alpha \in \RR\) is convex.
- Any half-space \(H_{-} = \setcond{x \in \Xx}{\varphi(x) \leq \alpha}\) is also convex.
- The unit simplex of \(\RR^{d}\) defined as \[ \Delta^{d-1} = \setcond{x \in \RR^{d}}{x_{i} \geq 0 \text{ and } \sum_{i} x_{i} = 1} \] is convex.
- The nonnegative orthant \(\RR_{\geq 0}^{d} = \setcond{x \in \RR^{d}}{x_{i} \geq 0}\) is convex.
The following proposition reviews some important algebraic properties of convex sets.
Proposition 1 (Operations on convex sets)
- Arbitrary intersection. Let \((C_{i})_{i \in I}\) convex sets where \(I\) is a set. Then \(\cap_{i \in I} C_{i}\) is convex.
- Cartesian product. Let \(C_{1},\dots,C_{n}\) be sets of \(\RR^{n_{1}}, \dots, \RR^{n_{n}}\). Then \(C_{1} \times \dots \times C_{n}\) is convex if, and only if, every \(C_{i}\) is convex.
- Sum. Let \(C_{1}\), \(C_{2}\) two convex sets. Then the (Minkowski) sum \(C_{1} + C_{2}\) is convex.
- Affine map. Let \(A: \RR^{n} \to \RR^{m}\) be an affine map and \(C \subseteq \RR^{n}\) a convex set. Then the image \(A(C)\) is convex.
Convex functions
Definition
The intuitive definition of a convex function is a function such that “the line joining \(f(x)\) and \(f(y)\) lies above the graph of \(f\) between \(x\) and \(y\)”. Formally, this definition makes sense using the definition of the epigraph of \(f\).
Definition 2 (Epigraph) Let \(f: \RR^{n} \to \RRb\) and \(f \neq +\infty\). The epigraph of \(f\) is the subset of \(\RR^{n} \times \RR\) defined by \[ \epigraph f := \setcond{(x,t) \in \RR^{n} \times \RR}{t \geq f(x)} . \]
Equipped with this notion, we can define convex functions using the definition of convex sets (Definition 1) applied to the epigraph (Definition 2).
Definition 3 (Convex function) A function \(f: \RR^{p} \to \RRb\) such that \(f \neq +\infty\) is convex function iff its epigraph \(\text{epi} f\) is convex.
It is possible to make explicit this definition as follow: \[ \forall x,y \in \dom f,\, \forall \lambda \in [0,1], \qquad f(\lambda x + (1-\lambda) y) \leq \lambda f(x) + (1-\lambda) f(y) . \tag{1}\] Following the convention introduced in (Hiriart-Urruty and Lemarechal 1996), we will denote the set of all convex functions of \(\RR^{p}\) as \(\operatorname{Conv} \RR^{p}\), and the set of closed – that is, with closed epigraph – convex functions of \(\RR^{p}\) as \(\overline{\operatorname{Conv}} \RR^{p}\).
Local and global minima of a convex function
For a (un)constrained minimization problem on a convex set \(C \subseteq \dom f\) with a function \(f : C \to \mathbb{R}\), \[ \min_{x \in C} f(x) , \tag{2}\] we say that \(x^\star\) is
a local solution of (Equation 2), if there exists a neighborhood \(\mathcal{O}\) of \(x^\star\) such that for all \(x \in C \cap \mathcal{O}\), \(f(x) \geq f(x^\star)\) ;
a (global) solution if for all \(x \in C\), \(f(x) \geq f(x^\star)\).
A fundamental result is that, when \(f\) is convex, every local solution of (Equation 2) is global solution, and the set of (global) solutions is a convex set.
Theorem 1 (Local minima of convex functions are global) Assume that \(f\) is a convex function and \(C\) is a convex closed set. Then,
Any local solution of (Equation 2) is a global solution;
The set of global solutions of (Equation 2) is convex.
Proof. Part 1. By contradiction. Consider a local solution \(x^\star\) and assume that \(x^\star\) is not a global solution. In particular, there exists \(z \in C\) such that \(f(z) < f(x^\star)\). By convexity, \[ f((1 - t) x^\star + t z) \leq (1-t) f(x^\star) + t f(z) < f(x^\star) \quad \forall t \in [0,1] \] Given any open neighborhood \(\mathcal{O}\) of \(x^\star\), there exists \(t > 0\) such that \((1-t) f(x^\star) + t f(z) < f(x^\star)\) (contradiction).
Part 2. Take two global solutions, and study the segment joining it: let \(x^\star\) and \(z\) two solutions. We have \(f(x^\star) = f(z)\), and in turn, \[ f(x^\star) \leq f((1 - t) x^\star + t z) \leq (1-t) f(x^\star) + t f(z) = f(x^\star), \] where the first inequality comes from the equality of the function at \(x^\star\) and \(z\), and the second by convexity.
Differentiable properties of a convex function
A differentiable function is convex if, and only if, it is above all its tangents, see Figure 1
Proposition 2 Let \(f\) be a differentiable function on an open set \(\Omega \subseteq \RR^{p}\), and \(C \subseteq \Omega\) a convex subset of \(\Omega\). Then \(f\) is convex on \(C\) if, and only if, \[ \forall x, \bar x \in C,\quad f(x) \geq f(\bar x) + \dotp{\nabla f(\bar x)}{x-\bar x} . \tag{3}\]
Proof. Assume \(f\) is convex on \(C\). Let \(x, \bar x \in C\) and \(\lambda \in (0,1)\). By Definition 3, we have \[ f(\lambda x + (1-\lambda)\bar x) - f(\bar x) \leq \lambda (f(x) - f(\bar x)) . \] Dividing by \(\lambda\), and letting \(\lambda\) goes to 0, we obtain Equation 3.
Reciprocally, let \(x,y \in C\), \(\lambda \in (0,1)\) and define \(z = \alpha x + (1-\alpha) y \in C\). Applying Equation 3 twice, we have \[\begin{align*} f(x) &\geq f(z) + \dotp{\nabla f(z)}{x-z} \\ f(y) &\geq f(z) + \dotp{\nabla f(z)}{y-z} . \end{align*}\] Using a convex combination of these two lines, we get \[ \alpha f(x) + (1-\alpha) f(y) \geq f(z) + \dotp{\nabla f(x)}{\alpha x + (1-\alpha)y - z} , \] which is exactly Equation 1.
A similar criterion can be derived for strictly convex function. Remark that Theorem 1 is a direct consequence of Proposition 2 for the class of differentiable convex functions: a local minima \(x^{\star}\) satisfies \(\nabla f(x^{\star}) = 0\) and thus (Equation 3) gives \(f(x) \geq f(x^{\star})\) for all \(x \in C\).
Strongly convex functions
Definition
A strongly convex function satisfies a stronger statement of the Jensen’s inequality.
Definition 4 (Strongly convex function) A function \(f: \RR^{n} \to \RRb\) is strongly convex of modulus \(\mu > 0\) if \(\dom f\) is convex and \[ f(\lambda x + (1-\lambda) y) \leq \lambda f(x) + (1-\lambda) f(y) - \frac12 \mu \lambda(1-\lambda) \|x-y\|^{2} , \] for all \(x,y \in \dom f\) and \(\lambda \in [0,1]\).
Another way to define strongly convex function is to say that \(f\) is \(\mu\)-strongly convex if the function \[ x \mapsto f(x) - \frac{\mu}{2} \norm{x}^{2} \] is a convex function.
A strongly convex function defined on the whole space \(\RR^{d}\) has a unique minimizer.
Differentiability, strong convexity and quadratic lower bound
A function \(f\) is strongly convex if, and only if, it is strictly above its tangent, and you can “fit” a quadratic form between the tangent and the function itself. This is a key property to defined lower bounds on \(f\).
Proposition 3 Let \(f\) be a differentiable and \(\mu\)-strongly convex function. Then, \[ \forall x, \bar x \in \dom f,\quad f(x) \geq f(\bar x) + \langle {\nabla f(\bar x)},{x-\bar x}\rangle + \frac{\mu}{2}\|x - \bar x\|^{2} . \]
Note that if \(f\) is \(C^{2}\), \(f\) is \(\mu\)-strongly convex if, and only if, its Hessian \(\nabla^{2} f \geq \mu \Id\), i.e., is definite positive.
An easy consequence is that it is possible to establish a lower bound on the values based on a lower bound on the distance between points: \[ f(x) \geq f(x^{\star}) + \frac{\mu}{2}\|x - x^{\star} \|^{2} , \tag{4}\] where \(x^{\star}\) is the minimizer of \(f\).
Lipschitz continuity of the gradient
Definition
Definition 5 (\(L\)-smoothness) A differentiable function \(f: \mathbb{R}^{n} \to \mathbb{R}\) is said to be \(L\)-smooth if \(\nabla f\) is \(L\)-Lipschitz, i.e., \[ \forall x, y \in \dom f, \quad \| \nabla f(x) - \nabla f(y) \|_{*} \leq L \| x - y \| . \] \end{definition} Here \(\| \cdot \|_{*}\) is the dual norm of \(\|\cdot\|\).
Quadratic upper bound
\(L\)-smoothness is important in optimization because it ensures a quadratic upper bound of the function.
Proposition 4 Let \(f\) a \(L\)-smooth function. Then, for all \(x, y \in \mathrm{dom}f\), \[ \langle \nabla f(x) - \nabla f(y), x-y \rangle \leq L \|x-y\|^{2} . \tag{5}\] Moreover, if \(\mathrm{dom}f\) is convex, then Equation 5 is equivalent to \[ f(y) \leq f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2} \| x-y \|^{2}, \quad \forall x, y \in \mathrm{dom}f . \tag{6}\]
Proof. The fact that \(L\)-smoothness implies (Equation 5) is a direct application of the (generalized) Cauchy-Schwarz inequality.
For the equivalence of (Equation 5) and (Equation 6), let’s prove the two direction. Given \(x, y \in \mathrm{dom}f\), define the real function \(g(t) = f(x + t(y-x))\).
Assuming Equation 5 holds, then \[ g'(t) - g'(0) = \langle \nabla f(x + t(y-x)) - \nabla f(x), y - x \rangle \leq t L \| x - y \|^{2} . \] Now, \[\begin{align*} f(y) = g(1) = g(0) + \int_{0}^{1} g'(t) &\leq g(0) + g'(0) + \frac{L}{2} \| x - y \|^{2} \\ &= f(x) + \langle \nabla f(x), y-x \rangle + \frac{L}{2} \| x-y \|^{2} . \end{align*}\]
Reciprocally, if (Equation 5) holds, it is sufficient to apply it twice (to \(x,y\) and \(y,x\)) and combine the two equations.
This result has a strong implication on the minima of \(f\).
Proposition 5 Let \(f\) be a \(L\)-smooth function with full domain, and assume \(x^{\star}\) is a minimizer of \(f\). Then, \[ \frac{1}{2L} \| \nabla f(z) \|_{*}^{2} \leq f(z) - f(x^{\star}) \leq \frac{L}{2} \| z - x^{\star} \|^{2} \quad \forall z \in \mathbb{R}^{n} . \]
Proof. The two side are proved separately.
rhs: Apply Equation 5 applied to ((x^{}, z)).
lhs: Minimize the quadratic upper bound for \(x = z\): \[\begin{align*} f(x^{\star}) = \min_{y \in \mathbb{R}^{n}} f(y) &\leq \inf_{y} \left( f(z) + \langle \nabla f(z), y - z \rangle + \frac{L}{2} \| y - z \|^{2} \right) \\ &= \inf_{\|v\|=1} \inf_{t} \left( f(z) + t \langle \nabla f(z), v \rangle + \frac{Lt^{2}}{2} \right) , \end{align*}\] where the last inequality is obtained by the change of variable \(v = \frac{w}{\|w\|}\), \(t = \|w\|\) and \(w = y-z\). Factorizing the trinom in \(t\) leads to: \[ f(z) + t \langle \nabla f(z), v \rangle + \frac{Lt^{2}}{2} = \frac{L}{2} \left( t + \frac{1}{L} \langle \nabla f(z), v \rangle \right)^{2} - \frac{1}{2L} \langle \nabla f(z), v \rangle^{2} + f(z) . \] Hence, \[ \min_{y \in \mathbb{R}^{n}} f(y) \leq \inf_{\|v\|=1} \left( f(z) - \frac{1}{2L} \langle \nabla f(z), v \rangle^{2} \right) . \] Minimizing the infimum boils down to maximizing \(\langle \nabla f(z), v \rangle^{2}\), that is obtained with \(v = \nabla f(z)\), hence the result.
Co-coercivity of the gradient of a \(L\)-smooth function
Convex \(L\)-smooth functions are co-coercive. This result is sometimes coined Baillon–Haddad theorem (Baillon and Haddad 1977), but one shall note that the original contributionis much more general than its application to the gradient.
Proposition 6 Let \(f\) be a full-domain convex \(L\)-smooth function. Then, \(\nabla f\) is \(1/L\)-co-coercive, i.e., \[ \forall x, y \in \mathbb{R}^{n},\quad \langle \nabla f(x) - \nabla f(y), x-y \rangle \geq \frac{1}{L} \| \nabla f(x) - \nabla f(y) \|_{*}^{2} . \]
Proof. Let \(f_{x}(z) = f(z) - \langle \nabla f(x), z \rangle\) and \(f_{y}(z) = f(z) - \langle \nabla f(y), z \rangle\). The two are convex.
Since \(f\) is \(L\)-smooth, \(f_{x}\) and \(f_{y}\) are also \(L\)-smooth.
Canceling the gradient of \(f_{x}\) shows that \(z = x\) minimizes \(f_{x}\). Hence, using the quadratic upper bound, we have \[\begin{align*} f(y) - f(x) - \langle \nabla f(x), x-y \rangle &= f_{x}(y) - f_{x}(x)\\ &\geq \frac{1}{2L} \| \nabla f_{x}(y) \|_{*}^{2} \\ &= \frac{1}{2L} \| \nabla f(y) - \nabla f(x) \|_{*}^{2} . \end{align*}\]
Doing exactly the same thing for \(f_{y}\) leads a similar bound, and combining the two gives the result.
It turns out that smoothness, co-coercivity and upper quadratic boundness are equivalent properties for full-domain convex function.
Proposition 7 Let \(f\) be a differentiable convex function with \(\dom f = \mathbb{R}^{n}\). Then the three following properties are equivalent:
\(f\) is \(L\)-smooth.
The quadratic upper bound Equation 6 holds.
\(\nabla f\) is \(1/L\)-co-coercive.
Proof. The only one to prove is 2. \(\implies\) 3. (by Cauchy-Schwarz).
For the special case of the \(\ell^{2}\) norm, we have:
Proposition 8 Let \(f\) be a differentiable convex function with \(\dom f = \mathbb{R}^{n}\), and \(L\)-smooth with respect to the Euclidean norm. Then,
\(L \mathrm{Id} - \nabla f\) is monotone, i.e. \[ \forall x,y \in \mathbb{R}^{n}, \quad \langle \nabla f(x) - \nabla f(y), x - y \rangle \leq L \| x - y \|_{2}^{2} . \]
\(\frac{L}{2} \| x \|_{2}^{2} - f(x)\) is convex.
If \(f\) is twice differentiable, then, \[ \lambda_{\mathrm{max}}(\nabla^{2} f(x)) \leq L, \quad \forall x \in \mathbb{R}^{n} . \]
When \(f\) is both \(L\)-smooth and \(\mu\)-strongly convex, we have the following “strenghened” co-coercivity.
Proposition 9 Let \(f\) be a \(L\)-smooth and \(\mu\)-strongly convex function. For any \(x, y \in \RR^{n}\), we have \[ \dotp{\nabla f(x) - \nabla f(y)}{x-y} \geq \frac{\mu L }{\mu + L} \norm{x-y}^{2} + \frac{1}{\mu + L} \norm{\nabla f(x) - \nabla f(y)}^{2} . \]
Proof. Consider the convex function \(\phi(x) = f(x) - \frac{\mu}{2} \norm{x}^{2}\). Its gradient reads \(\nabla \phi(x) = \nabla f(x) - \mu x\). Using the \(L\)-smoothness of \(f\), we have that \[\begin{align*} \dotp{\nabla \phi(x) - \nabla \phi(y)}{x-y} &= \dotp{\nabla f(x) - \nabla f(y)}{x-y} - \mu\dotp{x-y}{x-y} \\ &\leq (L-\mu) \norm{x-y}^{2} . \end{align*}\] Hence, \(\phi\) is \((L-\mu)\)-smooth.
If \(L=\mu\), then nothing is left to prove since \(f\) is \(\mu\)-strongly convex. Otherwise, the \((L-\mu)\)-co-coercivity of \(\phi\) (Proposition 6) gives us \[ \dotp{\nabla \phi(x) - \nabla \phi(y)}{x-y} \geq \frac{1}{L-\mu} \norm{\nabla \phi(x) - \nabla \phi(y)}^{2} . \] Explicing the expression of \(\nabla \phi(x)\) allows us to conclude.