View on GitHub

SeminarFromTheoryToAlgorithms

Website of the Seminar on the Second Part of the Book "Understanding Machine Learning", taught in Summer Term 2019 by Pascal Welke and Michael Kamp

Neural Networks

A neural network is a computation paradigm based on a layered network of neurons which are capable of simple computations. For a given input, the network will propagate signals down its layers and will output a result in its last one. Each neuron does some simple computation with its inputs and outputs its result to the next layer. A feed-forward network is one without cycles and will be the one that we work with.

A neuron’s function can be divided into two steps: A first one calculating a weighted sum of its inputs (the outputs of the previous neurons or the input values if in the first layer) and a second applying an activation function to the result of the weighted sum. This activation function can vary (e.g. sign, threshold, sigmoid, identity).

There are different ways of setting up a network. Different numbers of layers can be used as well as different numbers of neurons in each layer and the activation function they use. Each of these possible configurations is a different architecture. For each of them, there is a set of parameters that correspond to the input weights of its neurons. A network architecture is, then, a hypothesis class and then each set of weights form a hypothesis in this class.

Expressive Power - Boolean Functions

Claim

For every n there is a network of depth 2 such that the hypothesis class contains all functions

Proof: for every vector for which f outputs 1, we add a neuron in the hidden layer that checks if . That is, implements: . If any of these neurons is activated the network outputs a 1, otherwise, 0. can be written as .

The size of this network can be exponentially large with n but it is capable of representing any boolean function.

Theorem

For every , let be the minimal integer such that graph with and whose hypothesis class contains all functions . Then, is exponential in .

Proof: If the hypothesis class contains all such functions, it can shatter any input vector of size n. Hence, its VC dimension is . Shortly, we will show that the VC dimension of is bounded by . Then, we can state that is bounded by and .

Theorem

For , for every n, is the set of functions that can be implemented using a turing machine with runtime bounded by . Then, there is exists s.t. for every b there is a graph of size at most and contains .

The proof of this theorem uses the relationship between time complexity and circuit complexity. Every layer of the neural network acts as one processing step. Each neuron is capable of NAND operations.

Sample Complexity

Theorem

For a hypothesis class with a single output neuron, its VC dimension is .

Proof: For a subset of of size m, the growth function where is the restriction of the hypothesis space to the functions from C to Y, with Y being the set of output values.

Each layer of a network is a mapping . The hypothesis class H can be written as a composition of different hypothesis class corresponding to each of the layers: . Each of these classes has a corresponding growth function and we assume that the growth function of the composition is upper bounded by the product of the growth functions of its components:

Each layer contains neurons. Each of which can implement functions and is the set of functions the neuron from layer can implement. Then, . Assuming that the growth function of a hypothesis class formed by a product of classes is upper bounded by the product of the growth functions of its products, then :

Considering that a neuron at layer has inputs, by Sauer’s Lemma:

and considering that the number of all inputs is :

If we assume there are m shattered points, then, , then,

which implies and we prove our claim.

Learning Network Weights

Solving the problem of adjusting the input weights with ERM, is a NP-Hard problem. Therefore, stochastic gradient descent is often used. For one layer networks its application is very simple. For deeper networks, however, calculating the gradient over the error surface for neurons not in the last layer is nos as straight forward. For this, Backpropagation is used.

The Backpropagation Algorithm

We define the result of a weighted sum performed by the neuron on layer , . It’s output is the result of applying the transfer function to the : . The input weights of the neuron on layer is a vector

Calculating the gradient for neuron in the last layer is simple. We assume that the error function is . Then we to adjust each weight in the vector. For every weight in the vector:

In hidden layers, the weight of an input can affect the error in several of the next layer’s neurons. Therefore, the error gradient must take that into account. This is done using the concept of the error signal . The error signal provided by the last layer corresponds to the first term of the product shown previously or simply the difference between the expected result and the one obtained. For an intermediate layer, the error signal is a weighted sum of the next layer’s error signals considering the weights connecting the neuron to the ones in the next layer:

Then we can just insert this in the chain derivation and obtain the gradient for hidden layers:

Questions

What is the benefit of implementing a conjunctive normal formula (CNF) or disjunctive normal formula (DNF) using a neural network?

Are feed-forward neural networks Turing-complete?

How is the growth function in the proof of the VC-dimension defined?