View on GitHub

SeminarFromTheoryToAlgorithms

Website of the Seminar on the Second Part of the Book "Understanding Machine Learning", taught in Summer Term 2019 by Pascal Welke and Michael Kamp

Model Selection and Validation

§ 1 Introduction

Definition 1

Model Selection is the task of choosing the best algorithm and parameters for a given learning problem.

Example 1

When fitting a polynomial of degree $d$ to a function $f:\mathbb{R}\to\mathbb{R}$, the choice of $d$ has a huge impact on the result.

§ 2 Validation

Idea: Use additional samples according to $\mathcal{D}$ as a validation set, to estimate the true risk of our algorithms. Usually in practice, we do not generate additional samples, but take an existing set of samples and divide it into a training set $S$ and a validation set $V$, that is not used for training and also called a hold out set.

§ 2.1 Validation for Model Selection

Theorem 1 (without proof)

Let $h$ be a predictor and $V = (x_1,y_1), \ldots, (x_{m_v}, y_{m_v})$ a validation set, sampled according to $\mathcal{D}$ (and not used during training of $h$). Assume the loss function is in $[0,1]$. Then for every $\delta \in (0,1)$, with probability of at least $1-\delta$ over the choice of $V$, we have

Remark

This is a tight bound! The Fundamental Theorem of Learning gives similar bound, which is not quite as good. Theorem 1 bounds the true risk well, because $V$ is a “fresh” set not used in training, so $h$ and $V$ are independent.

Theorem 2 (without proof)

Let $\mathcal{H}={h_1, \ldots, h_r}$ be a set of predictors and $V = (x_1,y_1), \ldots, (x_{m_v}, y_{m_v})$ a validation set, sampled according to $\mathcal{D}$ (and not used during training of a predictor). Assume the loss function is in $[0,1]$. Then for every $\delta \in (0,1)$, with probability of at least $1-\delta$ over the choice of $V$, we have

Remark

This is very similar to learning a finite hypothesis class, but here, the predictors $h_i$ are the output of different learning algorithms (or the same algorithm with different parameters).

If $\vert\mathcal{H}\vert$ is not too large, this theorem implies, that validation sets can be used to approximate the true error. Otherwise, we risk overfitting.

§ 2.2 Model-Selection curve

The model selection curve is a plot of the training and validation error against the complexity of a model (e.g. for Example 1, we would plot the training and validation error against the degree of the polynomial.).

This tool makes it possible to see, for which level of complexity overfitting or underfitting occurs. Usually, we choose the complexity, for which the validation error is minimal.

§ 3 What to do if learning fails

When learning fails, we can do one of the following things:

But how can we decide what option to choose? To answer this question, we first need to understand the different causes of bad performance by decomposing the error terms.

§ 3.1 Error decomposition

Let $h_s$ be the hypothesis learned on set $S$ and define $h^* = \underset{h\in \mathcal{H}}{\text{argmin}}L_{\mathcal{D}}(h)$. We can decompose the true error as

where $L_{\mathcal{D}}(h^* )$ is called the approximation error and $L_{\mathcal{D}}(h_S) - L_{\mathcal{D}}(h^* )$ is called the estimation error. The approximation error is high if our hypothesis class is not capable of fitting the data, so underfitting necessarily occurs. However, we cannot directly measure these to errors.

Another decomposition is more practical:

Here, $L_{\mathcal{D}}(h_S) - L_{V}(h_S )$ is tightly bound by Theorem 1 and $L_{V}(h_S )- L_{S}(h_S )$ is a measurement of overfitting. For $L_{S}(h_S )$, we have another decomposing:

By definition of $h^{* }$, we have $L_{S}(h_S )) - L_{S}(h^* ) \leq 0$. In the term $L_{S}(h^* ) - L_{\mathcal{D}}(h^* )$, $h^{* }$ is independent of $S$, so by Theorem 1, the term is tightly bound as well. Now we know:

If $L_{S}(h_S)$ (the training error) is large, then $L_{\mathcal{D}}(h^* )$ (the approximation error) is large as well. In this case, you would change or enlarge the hypothesis class, or change the feature representation

Remark: The reverse does not hold. If the approximation error is large, the training error might be small.

§ 3.2 Learning curves

We will now consider the case of a small training error.

Assume that $L_S(h_S)=0$. There are two different (extreme) scenarios described below. As we will see, the learning curve helps to distinguish between those cases.

  $m < VC_{\dim}$ of learner, high approximation error $m> 2VC_{\dim}$ of learner, approximation error is 0
Learning curve: training error Constantly 0 Constantly 0
Learning curve: validation error High validation error, as the model is only “learning by heart” The validation error will be high at first, but then start to converge to 0
What to do Change hypothesis class or feature representation Obtain more data or reduce complexity of hypothesis class

For $m \to \infty$, validation and training error converge to approximation error.

§ 3.3 Summary

The following flow chart summarizes this chapter:

flowchart