View on GitHub

SeminarFromTheoryToAlgorithms

Website of the Seminar on the Second Part of the Book "Understanding Machine Learning", taught in Summer Term 2019 by Pascal Welke and Michael Kamp

Kernel Methods

§1. Embeddings into Feature Space

Not all labeled sets can be separated with a hyperplane. Consider the set

together with the labels $l(x) = +1 $ if $\rvert x \lvert > 2$ and $l(x) = -1$ otherwise for $x \in X$. Now $X$ cannot be separated with a hyperplane. However if we consider the map $\psi:\mathbb{R} \rightarrow \mathbb{R}^2, x \mapsto (x,x^2)$ then the image $\psi(X)$ without label changes can be separated using a hyperplane since $\psi$ aligns $X$ onto a parabola. This hyperplane is given by

where $w = (0,1)^T$ and $b=5$. The space $\psi(X)$ is called feature space and the map $\psi$ is called feature map. The general procedure for feature space embeddings is as follows.

  1. For a given domain $X$ choose a map $\psi:X \rightarrow \mathcal{F}$ into some Hilber space $\mathcal{F}$, called feature space.
  2. For $m$ labeled examples $S = {(x_1, y_1), …,(x_m,y_m) }$ consider the embedded examples $\hat{S} ={ (\psi(x_1), y_1), …, (\psi(x_m), y_m) }$.
  3. Train a predictor $h$ on $\hat{S}$.
  4. The classified label of a new point $x$ is given by $h(\psi(x))$.

Observe that for a given probability distribution $\mathcal{D}$ on $X \times Y$ we can define a probability distribution $\mathcal{D}^{\psi}$ on $\mathcal{F} \times Y$ by letting $\mathcal{D}^{\psi}(A) = \mathcal{D}(\psi^{-1}(A))$ for subsets $A \subseteq \mathcal{F} \times Y$. Hence for every predictor $h$

In order to understand the expressive power of feature space embeddings we consider the case degree $k$ multivariate polynomials $p: \mathbb{R}^n \rightarrow \mathbb{R}$. They can be expressed as

with coefficients $w_J \in \mathbb{R}$. We can write $p$ as a hyperplane $p(x) = (\psi(x),w)$ in some feature space $\mathbb{R}^d$. In order to do this we consider the feature map

and put $w = (w_J)_{J \in [n]^r, : r \leq k}$.

§2. The Kernel Trick

A kernel on a domain $X$ is a map $K: X \times X \rightarrow \mathbb{R}$ such that there exists a feature map $\psi : X \rightarrow \mathcal{F}$ into some Hilbert space $\mathcal{F}$ such that for all $x,y \in X$

where $(-,-)_{\mathcal{F}}$ is the scalar product on $\mathcal{F}$.

Theorem (Representer Theorem): Let $\psi :X \rightarrow \mathcal{F}$ be a feature map into a Hilbert space $\mathcal{F}$. Consider the optimization problem for $x_1, …, x_m \in X $

where $f:\mathbb{R}^m \rightarrow \mathbb{R}$ is an arbitrary function and is monotonically decreasing. Then there exists a vector such that is an optimal solution.

Proof: Let be an optimal solution to the optimization problem. Since and is a Hilbert space we can write

with and for all . Let . Using the polarization identities we find and hence . By monotonicity of it follows . Further we have

for all $i \leq m$. It follows that for our minimization problem

which means that $w$ is an optimal solution. $\square$

Assume that $\psi$ is the feature map of a kernel $K$. We can rewrite the optimization problem in terms of kernel evaluations only. This is called the kernel trick. The Representer Theorem tells us that the optimal solution $w^*$ lies in the span

This means that we can write $w = \sum_{j=1}^m \alpha_j \psi(x_j)$ and optimize over all $\alpha_j \in \mathbb{R}$ instead. We then have for all $i \leq m$

In a similar way we find

All together the optimization problem in question can be rewritten as

§3. Implementing Soft-SVM with Kernels

Let us consider the Soft-SVM optimization problem in the feature space $\mathcal{F}$

where $\lambda > 0$ is the margin and is the label asssociated to observation $x_i$. Observe that for each iteration $w^t$ of SGD we have . Hence we can maintain the corresponding coefficients $\alpha_i^t$, $i \leq m$ instead. We write $K$ for the kernel induced by $\psi$. The Kernel Soft-SVM algorithm takes the following form.

Definition (SGD for Solving Soft-SVM with Kernels):

Input: $T \in \mathbb{N}$, $\lambda > 0$

The following lemma tells us that this algorithm is equivalent to running SGD in the feature space.

Lemma: Let $\hat{w}$ be the output of SGD in the feature space. Let $\overline{w}=\sum_{j=1}^m \overline{\alpha}_j \psi(x_j)$ be the output of the above algorithm. Then $\hat{w}=\overline{w}$.

Proof: It is enough to show that for all $t\leq T$

where $\theta^t$ is the result of running SGD in the feature space. By definition $\alpha ^t = \frac{1}{\lambda t}\beta^t$ and $w^t = \frac{1}{\lambda t} \theta^t$. It follows that

To show that $ \theta^t = \sum_j \beta_j ^t \psi(x_j)$ we use induction on $t$. For $t=1$ the claim is obviously true. Assume the claim is true for $t \geq 1$ then

which means that the conditions checked in both algorithms are the same. For the update rule of $\theta^t$ we have

which means that the claim is true for $t+1$. This concludes the proof. $\square$