Kernel Methods
§1. Embeddings into Feature Space
Not all labeled sets can be separated with a hyperplane. Consider the set
together with the labels $l(x) = +1 $ if $\rvert x \lvert > 2$ and $l(x) = -1$ otherwise for $x \in X$. Now $X$ cannot be separated with a hyperplane. However if we consider the map $\psi:\mathbb{R} \rightarrow \mathbb{R}^2, x \mapsto (x,x^2)$ then the image $\psi(X)$ without label changes can be separated using a hyperplane since $\psi$ aligns $X$ onto a parabola. This hyperplane is given by
where $w = (0,1)^T$ and $b=5$. The space $\psi(X)$ is called feature space and the map $\psi$ is called feature map. The general procedure for feature space embeddings is as follows.
- For a given domain $X$ choose a map $\psi:X \rightarrow \mathcal{F}$ into some Hilber space $\mathcal{F}$, called feature space.
- For $m$ labeled examples $S = {(x_1, y_1), …,(x_m,y_m) }$ consider the embedded examples $\hat{S} ={ (\psi(x_1), y_1), …, (\psi(x_m), y_m) }$.
- Train a predictor $h$ on $\hat{S}$.
- The classified label of a new point $x$ is given by $h(\psi(x))$.
Observe that for a given probability distribution $\mathcal{D}$ on $X \times Y$ we can define a probability distribution $\mathcal{D}^{\psi}$ on $\mathcal{F} \times Y$ by letting $\mathcal{D}^{\psi}(A) = \mathcal{D}(\psi^{-1}(A))$ for subsets $A \subseteq \mathcal{F} \times Y$. Hence for every predictor $h$
In order to understand the expressive power of feature space embeddings we consider the case degree $k$ multivariate polynomials $p: \mathbb{R}^n \rightarrow \mathbb{R}$. They can be expressed as
with coefficients $w_J \in \mathbb{R}$. We can write $p$ as a hyperplane $p(x) = (\psi(x),w)$ in some feature space $\mathbb{R}^d$. In order to do this we consider the feature map
and put $w = (w_J)_{J \in [n]^r, : r \leq k}$.
§2. The Kernel Trick
A kernel on a domain $X$ is a map $K: X \times X \rightarrow \mathbb{R}$ such that there exists a feature map $\psi : X \rightarrow \mathcal{F}$ into some Hilbert space $\mathcal{F}$ such that for all $x,y \in X$
where $(-,-)_{\mathcal{F}}$ is the scalar product on $\mathcal{F}$.
Theorem (Representer Theorem): Let $\psi :X \rightarrow \mathcal{F}$ be a feature map into a Hilbert space $\mathcal{F}$. Consider the optimization problem for $x_1, …, x_m \in X $
where $f:\mathbb{R}^m \rightarrow \mathbb{R}$ is an arbitrary function and is monotonically decreasing. Then there exists a vector such that is an optimal solution.
Proof: Let be an optimal solution to the optimization problem. Since and is a Hilbert space we can write
with and for all . Let . Using the polarization identities we find and hence . By monotonicity of it follows . Further we have
for all $i \leq m$. It follows that for our minimization problem
which means that $w$ is an optimal solution. $\square$
Assume that $\psi$ is the feature map of a kernel $K$. We can rewrite the optimization problem in terms of kernel evaluations only. This is called the kernel trick. The Representer Theorem tells us that the optimal solution $w^*$ lies in the span
This means that we can write $w = \sum_{j=1}^m \alpha_j \psi(x_j)$ and optimize over all $\alpha_j \in \mathbb{R}$ instead. We then have for all $i \leq m$
In a similar way we find
All together the optimization problem in question can be rewritten as
§3. Implementing Soft-SVM with Kernels
Let us consider the Soft-SVM optimization problem in the feature space $\mathcal{F}$
where $\lambda > 0$ is the margin and is the label asssociated to observation $x_i$. Observe that for each iteration $w^t$ of SGD we have . Hence we can maintain the corresponding coefficients $\alpha_i^t$, $i \leq m$ instead. We write $K$ for the kernel induced by $\psi$. The Kernel Soft-SVM algorithm takes the following form.
Definition (SGD for Solving Soft-SVM with Kernels):
Input: $T \in \mathbb{N}$, $\lambda > 0$
- Initialize $\beta^1 =0$
- For $t = 1…T$
- $\alpha^t = \frac{1}{\lambda t} \beta^t$
- Choose $i \in [m]$ uniformly at random
- For all $j \neq i$
- $\beta_j ^{t+1}=\beta_j ^{t}$
- If $y_i \sum_{j=1}^m \alpha_j ^t K(x_j, x_i) < 1$
- $\beta_i ^{t+1}=\beta_i ^{t} + y_i$
- Else
- $\beta_i ^{t+1}=\beta_i ^{t}$
- Return where .
The following lemma tells us that this algorithm is equivalent to running SGD in the feature space.
Lemma: Let $\hat{w}$ be the output of SGD in the feature space. Let $\overline{w}=\sum_{j=1}^m \overline{\alpha}_j \psi(x_j)$ be the output of the above algorithm. Then $\hat{w}=\overline{w}$.
Proof: It is enough to show that for all $t\leq T$
where $\theta^t$ is the result of running SGD in the feature space. By definition $\alpha ^t = \frac{1}{\lambda t}\beta^t$ and $w^t = \frac{1}{\lambda t} \theta^t$. It follows that
To show that $ \theta^t = \sum_j \beta_j ^t \psi(x_j)$ we use induction on $t$. For $t=1$ the claim is obviously true. Assume the claim is true for $t \geq 1$ then
which means that the conditions checked in both algorithms are the same. For the update rule of $\theta^t$ we have
which means that the claim is true for $t+1$. This concludes the proof. $\square$