Stochastic Gradient Descent
§1. Classical Gradient Descent
In Classical Gradient Descent we want to minimize $f \in \mathrm{C}^1(S, \mathbb{R})$ where $S \subseteq \mathbb{R}^n$ is open. The idea is to iteratively descent into the negative gradient direction of $f$ with small stepsize.
Definition (Gradient Descent):
Input: $\eta > 0$, $T \in \mathbb{N}$, $f:S \rightarrow \mathbb{R}$
- Initialize $w^1 \in S$
- For $t = 1…T$
- $w^{t+1} = w^t - \eta \nabla f(w^t)$
- Return $\overline{w} = \frac{1}{T}\sum_{t\leq T} w^t$.
Other options: Return $w^T$ or best performing $w^{t_0}$ such that $f(w^{t_0}) \leq f(w^t)$ for all $t \leq T$.
§2. Subgradients
In this chapter we generalize Gradient Descent to non-differentiable functions.
Recall: If $f:S\rightarrow \mathbb{R}$ is convex then for any point $w \in S$ if $f$ is differentiable in $w$ then
where is the tangent hyperplane of $f$ at $w$.
Lemma: Let $S\subseteq \mathbb{R}^n$ be convex and open. Then a function $f:S \rightarrow \mathbb{R}$ is convex if and only if
Definition: Let $f:S \rightarrow \mathbb{R}$ be a function. A vector $v \in \mathbb{R}^n$ statisfying
at a given point $w \in S$ is called subgradient of $f$ at $w$. The set of all subgradients at $w$ is denoted with $\partial f(w)$.
Facts:
- If $f$ is convex then $\partial f(w) \neq \emptyset.$
- If $f$ is convex and differentiable at $w$ then
Example:
§3. Lemma 14.1
For later convergence theorems we need the following lemma.
Lemma 14.1: Let $v_1, …, v_n \in \mathbb{R}^n$. Then any algorithm with initialization $w^1=0$ and update rule $w^{t+1} = w^t - \eta v_t$ for $\eta >0$ statisfies for all $w^* \in \mathbb{R}^n$
In particular for every $B, \rho > 0$ if $ \rVert v_1 \lVert, …, \rVert v_T \lVert \leq \rho $ and $w^* \in \overline{\mathbb{B}}_B(0)$ then with $\eta = \frac{B}{\eta} \frac{1}{\sqrt{T}}$ we have
Proof: Recall the polarization identities
Hence
Summing over $t$ we get
since the second sum is telescopic and $w^1=0$ by assumption. $\square$
§4. Stochastic Gradient Descent
Definition (Stochastic Gradient Descent):
Input: $\eta > 0$, $T \in \mathbb{N}, f:S \rightarrow \mathbb{R}$
- Initialize $w^1=0$
- For $t=1…T$
- Draw $v_t$ according to some probability distribution such that $\mathbb{E}[v_t \mid w^t] \in \partial f(w^t)$
- $w^{t+1} = w^t - \eta v_t$
- Return $\overline{w} = \frac{1}{T} \sum_{t=1}^T w^t.$
Theorem: Let $B, \rho >0$ and $f:S \rightarrow \mathbb{R}$ convex. Let $w^* \in \mathrm{argmin}_{\rVert w \lVert \leq B} f(w)$ for $S \subseteq \mathbb{R}^n$. Assume that SGD runs with $T$ iterations and stepsize $\eta = \frac{B}{\rho}\frac{1}{\sqrt{T}}$ and assume that $ \rVert v_1 \lVert, …, \rVert v_T \lVert \leq \rho$ almost sure. Then
Therefore for given $\epsilon > 0$ to achieve $\mathbb{E}f(\overline{w}) - f(w^*) \leq \epsilon$ one needs to run $T \geq (B\rho / \epsilon)^2$ iterations of SGD.
Proof: We use the notation $v_{1:T} = v_1, …, v_T$. Since $f$ is convex we can apply Jensens inequality to obtain
for the iteratives $w^1, …, w^T$ of SGD. The expectation preserves this inequality and we obtain
Using lemma 14.1 we have
It is therefore enough to show that
Using linearity of expectation we get
Next recall the law of total expectation: Let $\alpha$, $\beta$ be random variables and $g$ some function then . Put and then
But now $w^t$ is completely determined by $v_{1:t-1}$ since $w^t = -\eta(v_1 + … + v_{t-1})$. It follows $\underset{v_{t}}{\mathbb{E}} [v_t \mid v_{t-1}] = \underset{v_{t}}{\mathbb{E}} [v_t \mid w^t] \in \partial f(w^t)$. Therefore
which also implies
Now summing over $t \leq T$ and dividing by $T$ gives the desired inequality. To achieve
we need $T \geq (B \rho / \epsilon)^2$ iterations. $\square$
§5. Stochastic Gradient Descent for Risk Minimization
In Learning Theory we want to minimize
SGD allows us to directly minimize $\mathcal{L}_{\mathcal{D}}$. For simplicity we assume that $l(-, z)$ is differentiable for all $z \in Z$. We construct the random direction $v_t$ as follows: Sample $z \sim \mathcal{D}$ and put
where the gradient is taken w.r.t. $w$. Interchanging integration and gradient we get
The same argument can be applied to the subgradient case. Let $v_t \in \partial l(w^t, z)$ for a sample $z \sim \mathcal{D}$. Then by definition for all $u$
By applying the expectation on both sides of the inequality we get
and therefore $\mathbb{E}[v_t \mid w^t] \in \partial \mathcal{L}_{\mathcal{D}}(w^t)$. This yields the following special form of SGD.
Definition (Stochastic Gradient Descent for Risk Minimization):
Input: $\eta > 0$, $T \in \mathbb{N}$
- Initialize $w^1=0$
- For $t=1…T$
- Sample $z \in \mathcal{D}$
- Pick $v_t \in \partial l(w^t, z)$
- $w^{t+1} = w^t - \eta v_t$
- Return $\overline{w} = \frac{1}{T} \sum_{t=1}^T w^t.$
Using Theorem 14.8. we get the following corollary.
Corollary: Let $B, \rho > 0$ and $\mathcal{L}_{\mathcal{D}}$ convex such that
almost sure. Let $w^* \in \mathrm{argmin}_{\rVert w \lVert \leq B}f(w)$. Then to achieve
for given $\epsilon > 0$ we need to run SGD with stepsize $\eta = \frac{B}{\rho}\frac{1}{\sqrt{T}}$ and $T \geq (B \rho / \epsilon)^2$ iterations.