View on GitHub

SeminarFromTheoryToAlgorithms

Website of the Seminar on the Second Part of the Book "Understanding Machine Learning", taught in Summer Term 2019 by Pascal Welke and Michael Kamp

Stochastic Gradient Descent

§1. Classical Gradient Descent

In Classical Gradient Descent we want to minimize $f \in \mathrm{C}^1(S, \mathbb{R})$ where $S \subseteq \mathbb{R}^n$ is open. The idea is to iteratively descent into the negative gradient direction of $f$ with small stepsize.

Definition (Gradient Descent):

Input: $\eta > 0$, $T \in \mathbb{N}$, $f:S \rightarrow \mathbb{R}$

Other options: Return $w^T$ or best performing $w^{t_0}$ such that $f(w^{t_0}) \leq f(w^t)$ for all $t \leq T$.

§2. Subgradients

In this chapter we generalize Gradient Descent to non-differentiable functions.

Recall: If $f:S\rightarrow \mathbb{R}$ is convex then for any point $w \in S$ if $f$ is differentiable in $w$ then

where is the tangent hyperplane of $f$ at $w$.

Lemma: Let $S\subseteq \mathbb{R}^n$ be convex and open. Then a function $f:S \rightarrow \mathbb{R}$ is convex if and only if

Definition: Let $f:S \rightarrow \mathbb{R}$ be a function. A vector $v \in \mathbb{R}^n$ statisfying

at a given point $w \in S$ is called subgradient of $f$ at $w$. The set of all subgradients at $w$ is denoted with $\partial f(w)$.

Facts:

Example:

§3. Lemma 14.1

For later convergence theorems we need the following lemma.

Lemma 14.1: Let $v_1, …, v_n \in \mathbb{R}^n$. Then any algorithm with initialization $w^1=0$ and update rule $w^{t+1} = w^t - \eta v_t$ for $\eta >0$ statisfies for all $w^* \in \mathbb{R}^n$

In particular for every $B, \rho > 0$ if $ \rVert v_1 \lVert, …, \rVert v_T \lVert \leq \rho $ and $w^* \in \overline{\mathbb{B}}_B(0)$ then with $\eta = \frac{B}{\eta} \frac{1}{\sqrt{T}}$ we have

Proof: Recall the polarization identities

Hence

Summing over $t$ we get

since the second sum is telescopic and $w^1=0$ by assumption. $\square$

§4. Stochastic Gradient Descent

Definition (Stochastic Gradient Descent):

Input: $\eta > 0$, $T \in \mathbb{N}, f:S \rightarrow \mathbb{R}$

Theorem: Let $B, \rho >0$ and $f:S \rightarrow \mathbb{R}$ convex. Let $w^* \in \mathrm{argmin}_{\rVert w \lVert \leq B} f(w)$ for $S \subseteq \mathbb{R}^n$. Assume that SGD runs with $T$ iterations and stepsize $\eta = \frac{B}{\rho}\frac{1}{\sqrt{T}}$ and assume that $ \rVert v_1 \lVert, …, \rVert v_T \lVert \leq \rho$ almost sure. Then

Therefore for given $\epsilon > 0$ to achieve $\mathbb{E}f(\overline{w}) - f(w^*) \leq \epsilon$ one needs to run $T \geq (B\rho / \epsilon)^2$ iterations of SGD.

Proof: We use the notation $v_{1:T} = v_1, …, v_T$. Since $f$ is convex we can apply Jensens inequality to obtain

for the iteratives $w^1, …, w^T$ of SGD. The expectation preserves this inequality and we obtain

Using lemma 14.1 we have

It is therefore enough to show that

Using linearity of expectation we get

Next recall the law of total expectation: Let $\alpha$, $\beta$ be random variables and $g$ some function then . Put and then

But now $w^t$ is completely determined by $v_{1:t-1}$ since $w^t = -\eta(v_1 + … + v_{t-1})$. It follows $\underset{v_{t}}{\mathbb{E}} [v_t \mid v_{t-1}] = \underset{v_{t}}{\mathbb{E}} [v_t \mid w^t] \in \partial f(w^t)$. Therefore

which also implies

Now summing over $t \leq T$ and dividing by $T$ gives the desired inequality. To achieve

we need $T \geq (B \rho / \epsilon)^2$ iterations. $\square$

§5. Stochastic Gradient Descent for Risk Minimization

In Learning Theory we want to minimize

SGD allows us to directly minimize $\mathcal{L}_{\mathcal{D}}$. For simplicity we assume that $l(-, z)$ is differentiable for all $z \in Z$. We construct the random direction $v_t$ as follows: Sample $z \sim \mathcal{D}$ and put

where the gradient is taken w.r.t. $w$. Interchanging integration and gradient we get

The same argument can be applied to the subgradient case. Let $v_t \in \partial l(w^t, z)$ for a sample $z \sim \mathcal{D}$. Then by definition for all $u$

By applying the expectation on both sides of the inequality we get

and therefore $\mathbb{E}[v_t \mid w^t] \in \partial \mathcal{L}_{\mathcal{D}}(w^t)$. This yields the following special form of SGD.

Definition (Stochastic Gradient Descent for Risk Minimization):

Input: $\eta > 0$, $T \in \mathbb{N}$

Using Theorem 14.8. we get the following corollary.

Corollary: Let $B, \rho > 0$ and $\mathcal{L}_{\mathcal{D}}$ convex such that

almost sure. Let $w^* \in \mathrm{argmin}_{\rVert w \lVert \leq B}f(w)$. Then to achieve

for given $\epsilon > 0$ we need to run SGD with stepsize $\eta = \frac{B}{\rho}\frac{1}{\sqrt{T}}$ and $T \geq (B \rho / \epsilon)^2$ iterations.

Questions

Is the learning rate of SGD in Theorem 14.8 decaying?

Is the minimum? If not, if we get any arbitrary would the algorithm converge to it?

Where is convexity requirement critical?

How do we sample from the distribution D

If one knows the distribution, can they calculate the gradient directly?

Are there cases where it is better to use SGD over GD and vice-versa?