View on GitHub

SeminarFromTheoryToAlgorithms

Website of the Seminar on the Second Part of the Book "Understanding Machine Learning", taught in Summer Term 2019 by Pascal Welke and Michael Kamp

What is Learning?

An insightful motivation in the first chapter is the following distinction between “good” learning and “bad” learning:

What is the difference that makes us consider the rats as “good” learners, and the pidgeons as “bad”, superstitious learners? One of the main reasons is our (prior) knowledge on what is going on behind the scenes of the situations: We know that the food is poisoned, hence avoiding it is actually smart; on the other hands, the actions of the pidgeons do not influence the occurrence of food, hence the actions of the pidgeons are “useless”.

Our Setting

Let’s see how we can formalize this setting..

Formal Model

Assumptions:

Let $\mathcal{X}$ be the domain set or universe, e.g. the set of all cookies. The domain set is often represented as $\mathcal{X} \subseteq \mathbb{R}^n$. Cookies, e.g. could be described by two dimensional real vectors where the first dimension measures a particular cookie’s crunchiness, while the second dimension measures its chocolate content.

Let $\mathcal{Y}$ be the set of target labels / possible outcomes, e.g. whether the cookies are tasty, or not. For the remainder of this section, we shall assume that $\mathcal{Y}$ only contains two values: $\mathcal{Y} = \{ 0,1 \}$.

There exists a probability distribution $\mathcal{D}$ over $\mathcal{X} \times \mathcal{Y}$ that is unknown, but fixed. From this distribution, we will draw (labeled) training examples as well as (unlabeled) test examples.

There exists a measure of success that tells us, how well our predictions do. That is, the error of a predictor $h$ that computes a value $h(x) \in \mathcal{Y}$ for any $x \in \mathcal{X}$ is defined as

$ L_{\mathcal{D}} (h) := \mathbb{P}_{(x,y) \sim \mathcal{D}} \left( h(x) \neq y \right) := \mathcal{D} \left( \{ (x,y) : h(x) \neq y \} \right) $

Input:

A training sequence $S = ((x_1, y_1), \ldots, (x_m, y_m))$ with $x_i \in \mathcal{X}$ and $y_i \in \mathcal{Y}$ for all $i \in [m]$. Each $(x_i, y_i)$ is drawn independently and identically from $\mathcal{D}$. We denote this by $S \sim \mathcal{D}^m$.

$S$ is all the learner sees. The learner has no access to $\mathcal{D}$, except via $S$.

Output:

A prediction rule $A(S)$ that can, given $x \in \mathcal{X}$, produce a value $A(S)(x) \in \mathcal{Y}$ such that $ L_{\mathcal{D}}(A(S)) $ is minimized.

$A(S)$ is not necessarily a function, but we write $A(S): \mathcal{X} \to \mathcal{Y}$, anyway.

Note further, that our goal is to minimize the true error over the whole distribution $\mathcal{D}$ given only a finite sample. This task is clearly impossible to solve exactly!

Empirical Risk Minimization (ERM):

A first, rather natural choice of learning paradigm is empirical risk minimization:

Given a training example $S \sim \mathcal{D}^m$, return a prediction rule $A(S)$ that minimizes the empirical error or empirical error

$ L_{S} (A(S)) := \frac{ \| \{ i \in [m] : A(S)(x_i) \neq y_i\} \| }{m} $.

Just like this, however, the ERM paradigm is very keen to overfit. A possible solution would be, to remember every training example and corresponding target values and to return a constant on unseen examples:

$ A(S) := \begin{cases} y_i & \text{ if } x = x_i \text{ for some } i \in [m] \\ 1 & \text{ otherwise} \end{cases} $

Intuitively, this is obviously no learning. There is no kind of generalization from the training examples to the unseen test examples going on. How can we fix this?

The answer is to include prior knowledge into the learning process. Formally, we will require our learning algorithm (here, ERM) to output a prediction rule from a fixed set of prediction rules selected in advance. We call this set of candidate prediction rules hypothesis class and will denote it by $\mathcal{H}$.

Then, ERM becomes the following:

Given a training example $S \sim \mathcal{D}^m$, and some finite representation of a hypothesis class $\mathcal{H}$ return a prediction rule $A(S) \in \operatorname{argmin}_{h \in \mathcal{H}} L_{S} (h) \subseteq \mathcal{H}$.

Intuitively, a more restricted hypothesis class $\mathcal{H}$ is a better safeguard against overfitting, but results in a stronger inductive bias or underfitting.

A hypothesis class $\mathcal{H}$ is called realizable if there exists $h \in \mathcal{H}$ with $L_{\mathcal{D}} (h) = 0$. This is a desirable property which will rarely be the case in practical learning scenarios.

Probably Approximately Correct (PAC) and Agnostic Probably Approximately Correct (APAC) Learning

The introduction of a hypothesis class $\mathcal{H}$ allows us to ask “Which hypothesis classes are learnable?”. That is, can we come up with a learning algorithm that minimizes $L_{\mathcal{D}} (A(S)) $ for any distribution $\mathcal{D}$ and any $S \sim \mathcal{D}^m$. Again, in this exact formulation, we can be sure to fail, but there is hope if we allow a certain error bound (i.e., the performance of our learning rule being worse than the best hypothesis in $\mathcal{H}$) and a confidence parameter (i.e., we allow the algorithm to fail miserably in certain fraction of cases).

Definition (APAC): A hypothesis class $\mathcal{H}$ of functions from $\mathcal{X}$ to $\mathcal{Y}$ is PAC learnable if there exists

Corrollary: Every realizable finite hypothesis class $\mathcal{H}$ is APAC learnable with sample complexity $m_{\mathcal{H}}(\epsilon, \delta) \leq \left\lceil \frac{\log(|\mathcal{H}| / \delta)}{\epsilon} \right\rceil$ using ERM.

Corrollary: Every finite hypothesis class $\mathcal{H}$ is APAC learnable with sample complexity $m_{\mathcal{H}}(\epsilon, \delta) \leq \left\lceil \frac{2 \log(2|\mathcal{H}| / \delta)}{\epsilon^2} \right\rceil$ using ERM.

Note that we are paying a price in terms of sample complexity if we want to learn using a hypothesis class that is not realizable.

No Free Lunch

Theorem: (No Free Lunch) Let $\mathcal{H}$ be some subset of $\{ f | f:\mathcal{X} \to \{0,1\} \}$ and let $A$ be any learning algorithm for binary classification with respect to $0-1$-loss over $\mathcal{X}$. Let $m \leq \frac{\mathcal{X}}{2}$ if $\mathcal{X}$ is finite, and $m$ finite, if $\mathcal{X}$ is infinite. Then there exists a distribution $\mathcal{D}$ over $\mathcal{X} \times \{0,1\}$ such that

Corrollary: The class $\mathcal{H} = \{f | f: \mathcal{X} \to \{0,1\} \}$ is not APAC learnable for infinite $\mathcal{X}$.

The trick in the proof of the no free lunch theorem is to choose a distribution $\mathcal{D}$ for the learner $A$ on which it fails. This can be done, as there is actually nothing we can possibly know about unseen data without making any assumptions! However, usually we are saved by some reasonable assumptions that we can make about the data / the world. Formally, this translates into restricting our hypothesis class $\mathcal{H}$ to a reasonable expressive power to restrict $\operatorname{argmin}_{h \in \mathcal{H}} L_{\mathcal{D}}(h) $ to something we can reach. To this end, we introduce shattering

Definition (Shattering): A hypothesis class $\mathcal{H}$ shatters a finite set $C \subseteq \mathcal{X}$ if the restriction of $\mathcal{H}$ to $C$ is the set of all functions from $C$ to $\{0,1\}$.

Corrollary Let $\mathcal{H} \subseteq \{ f | f: \mathcal{X}\to\{0,1\} \}$ and suppose there is $C\subseteq\mathcal{X}$ with $|C|=2m$ that is shattered by $\mathcal{H}$. Then No Free Lunch! (i.e., the claim stated in the no free lunch theorem holds)

Definition (VC-Dimension) The VC-Dimension of a hypothesis class $\mathcal{H} \subseteq \{ f | f: \mathcal{X}\to\{0,1\} \}$ is the cardinality of the largest set $C\subseteq \mathcal{X}$ that is shattered by $\mathcal{H}$.

Theorem If $\mathcal{H}$ has infinite VC-Dimension then $\mathcal{H}$ is not APAC learnable.

Theorem (Fundamental Theorem of APAC Learning) If $\mathcal{H} \subseteq \{ f | f: \mathcal{X}\to\{0,1\} \}$ has finite VC-Dimension $d$ then $\mathcal{H}$ is APAC learnable for the $0-1$ loss with

$ C_1 \frac{d+\log(1/\delta)}{\epsilon^2} \leq m_{\mathcal{H}}(\epsilon, \delta) \leq C_2 \frac{d + \log(1/\delta)}{\epsilon^2}\ .$