Building a Neural Network using Numpy

In this project, we implement an L-layered deep neural network from scratch using Numpy and train it on the MNIST dataset. Train and test accuracy of 87% is achieved on this dataset. The MNIST dataset contains scanned images of handwritten digits, along with their correct classification labels (between 0-9).

Data Preparation

The MNIST dataset we use here is 'mnist.pkl.gz' which is divided into training, validation and test data. The following function load_data() unpacks the file and extracts the training, validation and test data.

Let's see how the data looks:

The target variable is converted to a one hot matrix. We use the function one_hot to convert the target dataset to one hot encoding.

The following function data_wrapper() will convert the dataset into the desired shape and also convert the ground truth labels to one_hot matrix.

We can see that the data_wrapper has converted the training and validation data into numpy array of desired shapes. Let's convert the actual labels into a dataframe to see if the one hot conversions are correct.

Now let us visualise the dataset.

Feedforward

sigmoid

This is one of the activation functions. It takes the cumulative input to the layer, the matrix Z, as the input. Upon application of the sigmoid function, the output matrix H is calculated. Also, Z is stored as the variable sigmoid_memory since it will be later used in backpropagation.You use np.exp() here in the following way. The exponential gets applied to all the elements of Z.

relu

This is one of the activation functions. It takes the cumulative input to the layer, matrix Z as the input. Upon application of the relu function, matrix H which is the output matrix is calculated. Also, Z is stored as relu_memory which will be later used in backpropagation. You use np.maximum() here in the following way.

softmax

This is the activation of the last layer. It takes the cumulative input to the layer, matrix Z as the input. Upon application of the softmax function, the output matrix H is calculated. Also, Z is stored as softmax_memory which will be later used in backpropagation. You use np.exp() and np.sum() here in the following way. The exponential gets applied to all the elements of Z.

initialize_parameters

Let's now create a function initialize_parameters which initializes the weights and biases of the various layers. One way to initialise is to set all the parameters to 0. This is not a considered a good strategy as all the neurons will behave the same way and it'll defeat the purpose of deep networks. Hence, we initialize the weights randomly to very small values but not zeros. The biases are initialized to 0. Note that the initialize_parameters function initializes the parameters for all the layers in one for loop.

The inputs to this function is a list named dimensions. The length of the list is the number layers in the network + 1 (the plus one is for the input layer, rest are hidden + output). The first element of this list is the dimensionality or length of the input (784 for the MNIST dataset). The rest of the list contains the number of neurons in the corresponding (hidden and output) layers.

For example dimensions = [784, 3, 7, 10] specifies a network for the MNIST dataset with two hidden layers and a 10-dimensional softmax output.

Also, notice that the parameters are returned in a dictionary. This will help you in implementing the feedforward through the layer and the backprop throught the layer at once.

layer_forward

The function layer_forward implements the forward propagation for a certain layer 'l'. It calculates the cumulative input into the layer Z and uses it to calculate the output of the layer H. It takes H_prev, W, b and the activation function as inputs and stores the linear_memory, activation_memory in the variable memory which will be used later in backpropagation.


You have to first calculate the Z(using the forward propagation equation), linear_memory(H_prev, W, b) and then calculate H, activation_memory(Z) by applying activation functions - sigmoid, relu and softmax on Z.


Note that $$H^{L-1}$$ is referred here as Hprev. You might want to use np.dot()_ to carry out the matrix multiplication.

You should get:
array([[1. , 1. , 1. , 1. , 1. ],
[0.99908895, 0.99330715, 0.99999969, 1. , 0.99987661],
[0.73105858, 0.5 , 0.99330715, 0.9999546 , 0.88079708]])

L_layer_forward

L_layer_forward performs one forward pass through the whole network for all the training samples (note that we are feeding all training examples in one single batch). Use the layer_forward you have created above here to perform the feedforward for layers 1 to 'L-1' in the for loop with the activation relu. The last layer having a different activation softmax is calculated outside the loop. Notice that the memory is appended to memories for all the layers. These will be used in the backward order during backpropagation.

You should get:

(784, 10)
[[0.10106734 0.10045152 0.09927757 0.10216656 0.1 ]
[0.10567625 0.10230873 0.10170271 0.11250099 0.1 ]
[0.09824287 0.0992886 0.09967128 0.09609693 0.1 ]
[0.10028288 0.10013048 0.09998149 0.10046076 0.1 ]
[0.09883601 0.09953443 0.09931419 0.097355 0.1 ]
[0.10668575 0.10270912 0.10180736 0.11483609 0.1 ]
[0.09832513 0.09932275 0.09954792 0.09627089 0.1 ]
[0.09747092 0.09896735 0.0995387 0.09447277 0.1 ]
[0.09489069 0.09788255 0.09929998 0.08915178 0.1 ]
[0.09852217 0.09940447 0.09985881 0.09668824 0.1 ]]

Loss

compute_loss

The next step is to compute the loss function after every forward pass to keep checking whether it is decreasing with training.
compute_loss here calculates the cross-entropy loss. You may want to use np.log(), np.sum(), np.multiply() here. Do not forget that it is the average loss across all the data points in the batch. It takes the output of the last layer HL and the ground truth label Y as input and returns the loss.

You should get:

[[0.4359949 0.02592623 0.54966248 0.43532239 0.4203678 ]
[0.33033482 0.20464863 0.61927097 0.29965467 0.26682728]
[0.62113383 0.52914209 0.13457995 0.51357812 0.18443987]
[0.78533515 0.85397529 0.49423684 0.84656149 0.07964548]
[0.50524609 0.0652865 0.42812233 0.09653092 0.12715997]
[0.59674531 0.226012 0.10694568 0.22030621 0.34982629]
[0.46778748 0.20174323 0.64040673 0.48306984 0.50523672]
[0.38689265 0.79363745 0.58000418 0.1622986 0.70075235]
[0.96455108 0.50000836 0.88952006 0.34161365 0.56714413]
[0.42754596 0.43674726 0.77655918 0.53560417 0.95374223]]
[[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 1.]
[0. 0. 0. 0. 0.]
[1. 0. 1. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 1. 0. 0. 0.]
[0. 0. 0. 1. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]
[0. 0. 0. 0. 0.]]
0.8964600261334037

Backpropagation

Let's now get to the next step - backpropagation. Let's start with sigmoid_backward.

sigmoid-backward

You might remember that we had created sigmoid function that calculated the activation for forward propagation. Now, we need the activation backward, which helps in calculating dZ from dH. Notice that it takes input dH and sigmoid_memory as input. sigmoid_memory is the Z which we had calculated during forward propagation. You use np.exp() here the following way.

relu-backward

You might remember that we had created relu function that calculated the activation for forward propagation. Now, we need the activation backward, which helps in calculating dZ from dH. Notice that it takes input dH and relu_memory as input. relu_memory is the Z which we calculated uring forward propagation.

layer_backward

layer_backward is a complimentary function of layer_forward. Like layer_forward calculates H using W, H_prev and b, layer_backward uses dH to calculate dW, dH_prev and db. You have already studied the formulae in backpropogation. To calculate dZ, use the sigmoid_backward and relu_backward function. You might need to use np.dot(), np.sum() for the rest. Remember to choose the axis correctly in db.

You should get:
dH_prev is
[[5.6417525 0.66855959 6.86974666 5.46611139 4.92177244]
[2.17997451 0.12963116 2.74831239 2.17661196 2.10183901]]
dW is
[[1.67565336 1.56891359]
[1.39137819 1.4143854 ]
[1.3597389 1.43013369]]
db is
[[0.37345476]
[0.34414727]
[0.29074635]]

L_layer_backward

L_layer_backward performs backpropagation for the whole network. Recall that the backpropagation for the last layer, i.e. the softmax layer, is different from the rest, hence it is outside the reversed for loop. You need to use the function layer_backward here in the loop with the activation function as relu.

You should get:

dW3 is
[[ 0.02003701 0.0019043 0.01011729 0.0145757 0.00146444 0.00059863 0. ]
[ 0.02154547 0.00203519 0.01085648 0.01567075 0.00156469 0.00060533 0. ]
[-0.01718407 -0.00273711 -0.00499101 -0.00912135 -0.00207365 0.00059996 0. ]
[-0.01141498 -0.00158622 -0.00607049 -0.00924709 -0.00119619 0.00060381 0. ]
[ 0.01943173 0.0018421 0.00984543 0.01416368 0.00141676 0.00059682 0. ]
[ 0.01045447 0.00063974 0.00637621 0.00863306 0.00050118 0.00060441 0. ]
[-0.06338911 -0.00747251 -0.0242169 -0.03835708 -0.00581131 0.0006034 0. ]
[ 0.01911373 0.001805 0.00703101 0.0120636 0.00138836 -0.00140535 0. ]
[-0.01801603 0.0017357 -0.01489228 -0.02026076 0.00133528 0.00060264 0. ]
[ 0.0194218 0.00183381 0.00594427 0.01187949 0.00141043 -0.00340965 0. ]]
db3 is
[[ 0.10031756]
[ 0.00460183]
[-0.00142942]
[-0.0997827 ]
[ 0.09872663]
[ 0.00536378]
[-0.10124784]
[-0.00191121]
[-0.00359044]
[-0.00104818]]
dW2 is
[[ 4.94428956e-05 1.13215514e-02 5.44180380e-02]
[-4.81267081e-05 -2.96999448e-05 -1.81899582e-02]
[ 5.63424333e-05 4.77190073e-03 4.04810232e-02]
[ 1.49767478e-04 -1.89780927e-03 -7.91231369e-03]
[ 1.97866094e-04 1.22107085e-04 2.64140566e-02]
[ 0.00000000e+00 -3.75805770e-04 1.63906102e-05]
[ 0.00000000e+00 0.00000000e+00 0.00000000e+00]]
db2 is
[[ 0.013979 ]
[-0.01329383]
[ 0.01275707]
[-0.01052957]
[ 0.03179224]
[-0.00039877]
[ 0. ]]

Parameter Updates

Now that we have calculated the gradients. let's do the last step which is updating the weights and biases.

Having defined the bits and pieces of the feedforward and the backpropagation, let's now combine all that to form a model. The list dimensions has the number of neurons in each layer specified in it. For a neural network with 1 hidden layer with 45 neurons, you would specify the dimensions as follows:

Model

L_layer_model

This is a composite function which takes the training data as input X, ground truth label Y, the dimensions as stated above, learning_rate, the number of iterations num_iterations and if you want to print the loss, print_loss. You need to use the final functions we have written for feedforward, computing the loss, backpropagation and updating the parameters.

Since, it'll take a lot of time to train the model on 50,000 data points, we take a subset of 5,000 images.

Now, let's call the function L_layer_model on the dataset we have created.This will take 10-20 mins to run.

Let's see the accuray we get on the training data.

We get ~ 88% accuracy on the training data. Let's see the accuray on the test data.

It is ~87%. You can train the model even longer and get better result. You can also try to change the network structure.
Below, you can see which all numbers are incorrectly identified by the neural network by changing the index.