General guidelines for Deep Neural Networks

Note: This covers suggestions from Geoff Hinton's talk given at UBC which was recorded May 30, 2013. It does not cover bleeding edge techniques.

Below are suggestions for training and constructing a DNN (Deep Neural Network) given by Geoff Hinton. Each point below is further expanded on. [1]

  • Have a Deep Network.
  • Pretrain if you do not have alot of unlabelled training data. If you do skip it.
  • Initialize the weight to sensible values.
  • Use rectified linear units.
  • Have many more parameters than training examples.
  • Use dropout to regularize it instead of L1 and L2 regularization.
  • Convolutional Frontend (optional)

Have a Deep Network

Shallow Network

This is considered a shallow network.

This means have more than 2 hidden layers, which is what many would consider a deep network [2]. There should be an obvious limit to the number of layers you add. Additional layers cause many difficulties to arise, local optima and lack of data etc., while taking longer to train and run.

Why?

Deep Neural Networks are able to compute much more complex features from the input. A deep network, since it is computing a non-linear transformation (with tanh, sigmoid, Rectified Linear, etc. units), has much greater representational power then a shallow network. This representational power increases with the layers added.

Side note: There is an interesting paper that shows theoretically an ANN, with a single hidden layer but many units (large breadth)[3], can have a similar representational power to that of a deeper network. There is currently no known way to train such a network. Said paper is here.

Pretraining aka Greedy layer-wise training (when and where)

It is essentially what Hinton suggested, if you do not have many labeled training examples then perform greedy layer-wise pretraining. If you have many samples just perform training as normal with the full network stack.

Why?

Pretraining allows you to initialize the network parameters to values that will give better prediction performance then a similar DNN that was not pretrained. This point becomes moot if you have a large amount of unlabelled data.

Side Note: An interesting paper shows that unsupervised pretraining encourages sparseness in DNN. Link is here.

Initialize sensible weight values

How?

You can set the weights to a small random range. The distribution of these small random weights depends on the nonlinearity you are using in the network, which if you use Rectified linear units, would be a small positive value.

Why?

Stops the propagated gradients from vanishing or exploding. This occurs in deep networks when backpropagation is applied.

Rectified Linear Units

Rectified Linear Graph

Source

What?

f(x) = max(0, x) where

max function

Rectified Linear units are a drop in replacement for the traditional nonlinear activation functions. ReLUs (Rectified Linear Units) are capable of having values in the range of [0, inf].

Why?

It makes calculating the gradient during back propagation trivial. It is 0 if x < 0 and 1 elsewhere. This speeds up the training of the network.

The paper listed above in the pretraining section, shows that if ReLU units are used a DNN can be trained efficiently without any pretraining.

edit: As pointed out by andrewff on reddit I missed a few other important points with ReLu. See below:

ReLU units are more biologically plausible then the other activation functions, since they model the biological neuron's responses in their area of operation. While sigmoid and tanh activation functions are biologically implausible. A sigmoid has a steady state of around $\frac12$ and after initlizing with small weights fire at half their saturation potential. [5]

Biologically motivated units

Left: Firing of a neuron from biological data. Right: traditional activation functions. [5]

Number of Parameters >> Number of Training examples

Ensure that the total number parameters (a single weight in your network counts as one parameter) exceeds the number of training examples you have by a large amount. Hinton's suggestion is basically however much data you have, you should always use a neural net that will over-fit and at that point you should strongly regularize it. An example would be having 1000 training examples and 1 million parameters.

The brain operates in this regime of having many more parameters then training cases. "Synapses are much cheaper than experiences!".

Dropout for regularization

What?

Dropout is the technique where you omit hidden units within a hidden layer, this occurs each time a new training example is fed through the network. You are essentially randomly subsampling from the hidden layer. The different architectures are all sharing weights.

Why?

This is a form of model averaging (an approximation). It creates a very strongly regularized model. The reason why this works so well is because of the shared weights amongst all the different subsampled architectures. Each sub-model only gets one training example. So instead of what usually happens in L2 and L1 regularization, where its pulled to zero, they are pulled to seasonable values, what the other models want. Neat.

Convolution (optional)

If you data contains any spacial structure such as voice, images, video etc. then use a convolutional frontend.

What?

Convolution is an operation applied between two signals who's product is a measure of similarity. This is used in image or voice recognition by comparing two 'signals' to each other. One signal is the image to be classified and the other is a learned filter. This filter is called a feature map. It recognizes one particular feature. In image recognition this could be an edge (vertical, horizontal etc.). This filter, in the case of images, slides around our original where at each point a convolution operation is taken. This is applied on each 'slide' and the resulting single pixel is a representation of how similar that sub-patch of pixels was to our filter.

Why?

The spatial data is important in things like voice or image detection. Imagine I said 'bat' out loud but swapped the 'b' and 't' around, now I have the word 'tab' which has a completely different meaning. The spatial location of the data is important in tasks such as these and provides more information about what the data is. Convolution front ends build feature maps of our data that encodes these spatial structures.

[1] Geoff Hinton - Recent Developments in Deep Learning

[2] Exploring Strategies for Training Deep Neural Networks.

[3] Scaling Learning Algorithms towards AI

[4] Understanding the dif´Čüculty of training deep feedforward neural networks

[5] Deep Sparse Rectifier Neural Networks