View on GitHub

SeminarLearningTheory

Website for the Seminar on Learning Theory, taught WS18/19 by Michael Kamp and Pascal Welke

Minimum Description Length (MDL) Principle and Occam’s Razor

In this chapter, we discuss the minimum description length principle as a special case of SRM and its philosophical analogue Occam’s Razor.

Minimum Description Length Principle

If we assume that $\mathcal{H}$ is countable, we can decompose it into singleton hypothesis classes:

$\mathcal{H} = \bigcup_{n \in \mathbb{N}} \{h_n\}.$

In this setting, we have:

$n(h) = n$

By the known bound for finite hypothesis classes we get:

$m^{UC}_{\{h_n\}} \leq log\left(\frac{\frac{2}{\delta}|\{h_n\}|}{2\epsilon^2}\right) = log\left(\frac{\frac{2}{\delta}}{2\epsilon^2}\right)$

and $\epsilon_n$ becomes:

$\epsilon_n(m,\delta) = \sqrt{\frac{log(\frac{2}{\delta})}{2m}}.$

To simplify notation, we can index the weights of a hypothesis not by $n$ as previously $w(n(h))$), but directly with $h$ as $w(h)$. Altogether, the SRM rule becomes:

$\boxed{\min_{h \in \mathcal{H}} L_S(h) + \sqrt{\frac{-log(w(h)) + log(\frac{2}{\delta})}{2m}}}$

(Prefix Free) Description Languages

We want to use a special weighting scheme based on a description language:

We will need prefix-free languages:

$\forall h,h’ \in \mathcal{H}, h \neq h’: d(h) \text{ is not a prefix of } d(h’).$

Kraft Inequality: $\mathcal{S}\subseteq \{0,1\}^*$ prefix-free:

$\sum_{\sigma \in \mathcal{S}}\frac{1}{2^{|\sigma|}} \leq 1$

Proof: Assume, we draw a binary sequence uniformly at random and stop if it equals a string in $\mathcal{S}$. Then

$\forall \sigma \in \mathcal{S}: P(\sigma) = \frac{1}{2^{|\sigma|}}.$

Overall,

$\sum_{\sigma \in \mathcal{S}}\frac{1}{2^{|\sigma|}} = P(\mathcal{S}) \leq 1.$

Due to the Kraft inequality, we can use the description language $d(h)$ as a hypothesis weight $w(h) = \frac{1}{w^{|h|}}$.

If we plug $d(h)$ into the SRM rule for singleton hypothesis, we get the Minimum Description Length principle:

$\boxed{\min_{h \in \mathcal{H}}L_S(h) + \sqrt{\frac{|h| + log(\frac{2}{\delta})}{2m}}}$

It can be seen as a tradeoff between the empirical risk of a hypothesis $L_S(h)$ and the complexity of describing $h$.

Occam’s Razor

A philosophical view of MDL is the so-called Occam’s Razor:

“A short explanation tends to be more valid than a long one”

Consistency

Consistency is a notion even weaker than nonuniform learning. It allows the sample complexity to depend on the data generating distribution:

Algorithm $A$ is consistent wrt. $\mathcal{H}$ and $\mathcal{P} = ($set of distributions$)$ if:

$\exists \text{ function } m_{\mathcal{H}}^{CON}:(0,1)^2\times\mathcal{H}\times\mathcal{P} \rightarrow \mathbb{N}$ s.t.

$\forall \epsilon, \delta, \forall h \in \mathcal{H}, \forall \mathcal{D} \in \mathcal{P}$:

$\text{ for } S \text{ drawn i.i.d from } \mathcal{D}^m \text{ for } m \geq m_{\mathcal{H}}^{CON}(\epsilon, \delta, h, \mathcal{D}):$

$L_\mathcal{D}(A(S)) \leq L_\mathcal{D}(h) + \epsilon$

A Consistent Algorithm

Given an instance $x$, the algorithm Memorize returns the majority of known labels of instances in the sample $S$, which equal $x$. If there are no instances in $S$ equal to $x$ predict the majority of labels in $S$. It can be shown that this algorithm is consistent.

This example makes it clear, that consistency is too weak to capture “learning” (or to be more precise: Generalization abilities).

Comparison of Notions for Learning

The following table compares the three different notions of learning in terms of bounds and prior knowledge.

  Bounds on the true error based on the empirical risk Bounds on the quality of the output compared to the best hypothesis in the class based on the sample size Encode prior knowledge
(agn.) PAC $\checkmark$ $\checkmark$ (in advance) $\checkmark$ (by specifying an $\mathcal{H}$)
Nonuniform $\checkmark$ (e.g. Thm 7.4 in the book) $\checkmark$ (depends on the best $h \in \mathcal{H}$) $\checkmark$ (weights)
Consistent $\times$ $\times$ $\times$