Regularization and Stability
§ 0 Overview
First we will define Regularized Loss Minimization and see how stability of learning algorithms and overfitting are connected. Then we are going to proof some general bounds about stability for Tikhonov regularization. To get useful bounds, we have to add further assumptions like a Lipschitz loss function, or a $\beta$-smooth loss function. However, only Lipschitz loss functions are considered here. We will proof that learning problems with convex-Lipschitz-bounded loss function and Tikhonov regularization are APAC learnable. We will also see (without proof) a similar result for Ridge Regression, which has a non-Lipschitz loss function.
§ 1 RLM Rule
Definition 1: Regularized Loss Minimization (RLM)
Regularized Loss Minimization is a learning rule in the form of $\underset{w}{\text{argmin}} (L_S(w) + R(w))$, with a regularization function $R: \mathbb{R}^d \to \mathbb{R}$ . The case with $R(w) = \lambda \lVert w \rVert^2_2$ for $\lambda>0$ is called Tikhonov regularization.
§ 2 Stable Rules and Overfitting
Notations
Symbol | Meaning |
---|---|
$U(m)$ | uniform distribution over $[m]={1,\ldots,m}$ |
$A$ | a learning algorithm |
$S = (z_1, \ldots,z_m)$ | training set |
$A(S)$ | output of $A$ after processing $S$ |
$z’$ | another training example |
$S^{(i)}$ | $(z_1,\ldots,z_{i-1},z’,z_{i+1}, \ldots, z_m)$ |
Definition 2: Stability
Let $\epsilon: \mathbb{N} \to \mathbb{R}$ be monotonically decreasing. A is on-average-replace-one-stable with rate $\epsilon(m)$ if for every distribution $\mathcal{D}$
holds.
Remark
For a reasonable learner, the term $l(A(S^{(i)}), z_i) - l(A(S),z_i)$ will typically be greater than zero, because $z_i$ was used during the learning of $A(S)$, but not while learning $A(S^{(i)})$.
Theorem 1
Let $\mathcal{D}$ be a distribution. Let $S = (z_1, \ldots,z_m)$, $z’$ be examples, independent and identically distributed according to $\mathcal{D}$. Then for any learning algorithm $A$:
Proof
For all $i\in [m]$ we have:
For the first equation we used the definition of the true error $L_{\mathcal{D}}$ and for the second equation, we swapped the names of the i.i.d variables $z’$ and $z_i$. Using the definition of the empirical error $L_S$ as a weighted sum of terms of the form $l(-,z_i)$, which can be written as an expectation value as well, yields:
Combining both equations finishes the proof. $\square$
Remark
As $\underset{S\sim\mathcal{D}^{m}}{\mathbb{E}}\left[L_{\mathcal{D}}(A(S)) - L_S(A(S))\right]$ is a measurement of overfitting, Theorem 1 tells us, simply put, that “stable rules do not overfit”.
§ 3 Strong Convexity
Definition 3
A function $f$ is $\lambda$-strongly convex, if for all $w, u, \alpha\in(0,1)$ we have
Lemma 1
- $f(w)=\lambda\vert\vert w\vert\vert^2$ is $2\lambda$ strongly convex.
- If $f$ is $\lambda$ strongly convex and $g$ is convex then $f+g$ is $\lambda$ strongly convex.
- If $f$ is $\lambda$ strongly convex and $u$ is a minimizer of $f$, then for any $w$:
Proof
1 and 2 are easy to check, so only a proof of 3 is provided here: First we divide the definition of strong convexity by $\alpha$ and rearrange to get the following:
Now let $g(\alpha)=f(u+\alpha(w-u))$ and take the limit $\alpha \to 0$. Using that $u$ is a minimizer, we obtain
$\square$
§ 4 Tikhonov Regularization as a Stabilizer
From now on, we will assume our loss function to be convex. Our goal will be to bound $\vert A(S^{(i)})-A(S)\vert$ for Tikhonov regularization.
We define $f_S(w) = L_S(w) + \lambda\lVert w \rVert^2$ and $A(S)=\underset{w}{\text{argmin }} f_S(w)$.
By Lemma 1.2, $f_S$ is $2\lambda$-strongly convex. Now for any $v$ we have
Also for any $u, v$ and $i$, we have
For $v=A(S^{(i)}), u=A(S)$, $v$ is a minimizer of $f_{S^{(i)}}$, so we obtain
By (1) it follows, that
This is a general bound for Tikhonov regularization. To bound this further, we will now assume our loss function to be Lipschitz.
Theorem 2
Assume a convex, $\rho$-Lipschitz loss function. Then the RLM rule with $\lambda\lVert w \rVert ^2$ regularization is on-average-replace-one-stable with rate $\frac{2\rho^2}{\lambda m}$. By Theorem 1, this also implies, that
Proof
Let $l(-,z_i)$ be $\rho$-Lipschitz. Then by definition
Plugging this into (2) gives us
We now insert this into (3) and finally get
As this holds for any $S, z’, i$, taking expectations will conclude the proof. $\square$
§ 5 Controlling the Fitting Stability Tradeoff
The theorem above shows, that the stability term decreases, when $\lambda$ increases. As the empirical risk also increases with $\lambda$, we face a tradeoff between fitting and overfitting. In this section, we will choose a value of $\lambda$ to derive a new bound for the true risk.
Theorem 3
Assume a convex, $\rho$-Lipschitz loss function. Then the RLM rule with Tikhonov regularization satisfies:
Remark
This is bound is also called oracle inequality . We may think of $w^{* }$ as hypothesis with low risk. $A(S)$ will then only be slightly worse than $w^{* }$.
Proof
We have $L_{S}(A(S)) \leq L_{S}(A(S)) + \lambda||A(S)||^2 \leq L_{S}(w^{* }) + \lambda\lVert w^{* }\rVert^2$, as $A(S)$ is a minimizer of $L_S$. Taking expectations and using $\underset{S\sim\mathcal{D}^{m}}{\mathbb{E}}\left[L_{S}(w^{* }) \right] =L_{\mathcal{D}}(w^* )$ yields
and using (4) we get
Applying Theorem 2 finishes the proof. $\square$
Corollary 1
For a convex-Lipschitz-bounded learning problem $(\mathcal{H},Z,l)$ with parameters $\rho, B$ and $\lambda := \sqrt{\frac{2\rho^2}{B^2m}}$ and Tikhonov regularization, we get
In particular this implies, that for every $\epsilon > 0$ and distribution $\mathcal{D}$, we get
if $m \geq \frac{8\rho^2 B^2}{\epsilon^2}$.
(Note that bounded means: $\lVert w \rVert \leq B$ for all $w \in \mathcal{H}$)
Proof
The corollary follows directly by setting $w^{* }$ to $\underset{w \in \mathcal{H}}{\text{argmin }} L_{\mathcal{D}}(w)$, inserting $\lambda$ in Theorem 3, and using $\lVert w^{* } \rVert \leq B$. $\square$
§ 6 APAC Learnability
Convex-Lipschitz-bound problems are APAC learnable, as Lemma 2 will show:
Lemma 2
If an algorithm A guarantees that for $m \geq m_{\mathcal{H}}(\epsilon)$ and every distribution $\mathcal{D}$ holds:
then the problem is APAC learnable by $A$.
Proof
Let $\delta \in (0,1)$, $m\geq m_{\mathcal{H}}(\epsilon \delta)$. Define $X=L_{\mathcal{D}}(A(S)) - \underset{h \in \mathcal{H}}{\min} L_{\mathcal{D}}(h)$. Then $X\geq0$ and by our assumption $\mathbb{E}[X] \geq \epsilon \delta$. Using Markov’s inequality, we obtain
$\square$
§ 7 Ridge Regression
Definition 4
Ridge Regression is the following learning rule with squared loss:
Remark
This is just the RLM rule with Tikhonov regularization and loss function $l(w,z’) = {\frac{1}{2}(\left<w,x’\right> -y’)^2}$. But the loss function is not Lipschitz, so we cannot apply the theorems for convex-Lipschitz-bound functions! However we have, that
is $\beta$-Lipschitz (for some value of $\beta$). Functions with this property, i.e. $\beta$-Lipschitz gradients, are called $\beta$-smooth.
Remark
For $\beta$-smooth functions, very similar results to the previously stated theorems and corollaries for Lipschitz functions hold. Especially we get:
Theorem 4 (without proof)
For a distribution $\mathcal{D}$ over $\chi \times [0,1]$ with $\chi = { x\in \mathbb{R}^d | x \leq 1}$, $\mathcal{H}={w\in \mathbb{R}^d: \lVert w \rVert \leq B}$, $\epsilon\in(0,1)$, $m \geq \frac{150 B^2}{\epsilon^2}$, $\lambda = \frac{\epsilon}{3B^2}$, the Ridge Regression algorithm satisfies
This implies that there is an APAC learner for Ridge Regression.
§ 8 Questions
Q1: What does Theorem 3 tell us?
A1: The error is bounded by the regularizer as $m \rightarrow \infty.$
Q2: Why is $\lambda \lVert . \rVert ^2$ $2 \lambda$-strongly convex?
A2: This can be proven using the polarization identities.
Q3: In Corollary 1 why is $w$ bounded in a ball centered at $0$.
A3: We can change the regularizer and get the same theorem with $w$ being bounded in a ball with different center.
Q4: Are there practical applications to convergence theorems on Tikhonov regularization?
A4: Yes, Ridge Regression.