09_Notes for Neural Networks in Practice
Carpe Tu Black Whistle

back prop as an algorithm has a lot of details and can be a little bit tricky to implement.

Unrolling Parameters

With neural networks, we are working with sets of matrices:

In order to use optimizing functions such as “fminunc()”, we will want to “unroll” all the elements and put them into one long vector:

1
2
thetavec = [Theta1(:);Theta2(:);Theta3(:)];
deltavec = [D1(:);D2(:);D3(:)];

If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the “unrolled” versions as follows:

1
2
3
Theta1 = reshape(thetaVector(1:110),10,11)
Theta2 = reshape(thetaVector(111:220),10,11)
Theta3 = reshape(thetaVector(221:231),1,11)

image

Gradient Checking

使用数值计算,获取偏导数的近似值,来验证反向传播的正确性。

We can approximate the derivative of our cost function with:

With multiple theta matrices, we can approximate the derivative with respect toas follows:

A small value for(epsilon) such as, guarantees that the math works out properly. If the value foris too small, we can end up with numerical problems.

Implementation

in Matlab or Octave:

1
2
3
4
5
6
7
8
epsilon = 1e-4;
for i = 1:n,
thetaPlus = theta;
thetaPlus(i) += epsilon;
thetaMinus = theta;
thetaMinus(i) -= epsilon;
gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
end;

Check

check that gradApproxdeltaVector.

Stop the computation of gradApprox

Computing the gradAppox is rather slower than the backprop algorithm.

Summary

image

Random Initialization

Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly.

Instead randomly initialize weights formatrices using the following method:

Random initialization: Symmetary breaking

Initialize eachto a random value in

Code Implementation

1
2
3
4
5
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;

initializing eachto random value between

rand(x,y) is just a function in octave that will initialize a matrix of random real numbers between 0 and 1.
(Note: the epsilon used above is unrelated to the epsilon from Gradient Checking)

Put it together

architecture setting

First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers in total you want to have.

  • Number of input units = dimension of features
  • Number of output units = number of classes
  • Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
  • Defaults: 1 hidden layer. If you have more than 1 hidden layer, then it is recommended that you have the same number of units in every hidden layer.

Training a Neural Network

  1. Randomly initialize the weights
  2. Implement forward propagation to getfor any
  3. Implement the cost function
  4. Implement backpropagation to compute partial derivatives
  5. Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
  6. Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.

When we perform forward and back propagation, we loop on every training example:

1
2
3
for i = 1:m,
Perform forward propagation and backpropagation using example (x(i),y(i))
(Get activations a(l) and delta terms d(l) for l = 2,...,L

Ideally, you want. This will minimize our cost function. However, keep in mind thatis not convex and thus we can end up in a local minimum instead.