Introduction To Deep Learning Assignment 0: September 2023
Introduction To Deep Learning Assignment 0: September 2023
Assignment 0
September 2023
Note: This assignment is not going to be graded. However, you do need to form groups of three people
and submit your work on Brightspace.
Your submission should consist of a report in pdf format (at most 3 pages) and neatly organized code
(Jupyter notebooks) that you’ve used in your experiments so that we can reproduce your results. You can find
more information about report writing on the Brightspace → Assignments section. Do not use compression
on the files (.zip/.7z/.rar) - submit every file individually.
This assignment is mainly to get accustomed to the requirements of this course and Brightspace. In case
your report has some clear issues, we will provide feedback so that you are more prepared for the graded
assignments.
Introduction
The objective of this assignment is to develop and evaluate several algorithms for classifying images of
handwritten digits and designing your own neural network from scratch for the XOR problem. You will
work with a simplified version of the famous MNIST data set: a collection of 2707 digits represented by
16x16 vectors. The data is split into a training set (1707 images) and a test set (1000 images). These data
sets are stored in 4 files: train in.csv, train out.csv, test in.csv, test out.csv, where in and out
refer to the input records (images) and the corresponding digits (class labels), respectively. These files are
stored in data.zip.
You may find more information about the original problem of handwritten digit recognition, more data
sets, and an overview of the accuracies of best classifiers (it is about 99.7%!) at
http://yann.lecun.com/exdb/mnist/.
1
3. Use the inter-class distances obtained in part 1 to implement a Nearest mean classifier. Apply your
classifier to all points from the training set and calculate the percentage of correctly classified digits.
Do the same with the test set, using the centers that were calculated from the training set.
4. A less naive distance-based approach is the KNN (K-Nearest-Neighbor) classifier (you can either im-
plement it yourself or use the one from sklearn package). Repeat the same procedure as in part 3
by using this method. Then, for both classifiers, generate a confusion matrix which should provide a
deeper insight into classes that are difficult to separate. A confusion matrix is here a 10-by-10 matrix
(cij ), where cij contains the percentage (or count) of digits i that are classified as j. Which digits are
most difficult to classify correctly? Again, for calculating and visualising confusion matrices you may
use the sklearn package. Describe your findings, and compare the performance of your classifiers on
the train and test sets.
Implement (from scratch) a multi-class perceptron training algorithm (”Perceptron learning rule” from slide
34, second lecture) and use it for training a single layer perceptron with 10 nodes (one per digit), each node
having 256+1 inputs (inputs and bias) and 1 output. Train your network on the train set and evaluate
on both the train and the test set, in the same way as you did in the previous task. As your algorithm is
non-deterministic (results depend on how you initialize weights), repeat your experiments a few times to get
a feeling of the reliability of your accuracy estimates.
Try to make your code efficient. In particular, try to limit the number of loops, using matrix multiplication
whenever possible. For example, append to your train and test data a column of ones that will represent
the bias. The weights of your network can be stored in a matrix W of size 257x10. Then the output of the
network on all inputs is just a dot product of two matrices: T rain and W, where T rain denotes the matrix
of all input vectors (one per row), augmented with 1’s (biases). To find the output node with the strongest
activation use the numpy argmax() function. An efficient implementation of your algorithm shouldn’t take
more than a few seconds to converge on the training set (yes, the training set consists of patterns that are
linearly separable so the perceptron algorithm will converge).
How does the accuracy of this single-layer multi-class perceptron compare to the distance-based methods
in task 1?
Task 3: Implement the XOR network and the Gradient Descent Algorithm
This is probably the last time in your life that you are asked to implement a neural network from scratch –
therefore, have fun! Proceed as follows:
1. Implement the function xor net(inputs, weights) that simulates a network with two inputs, two hidden
nodes and one output node. The vector weights denote 9 weights (tunable parameters): Each non-
input node has three incoming weights: one connected to the bias node that has value 1, and two
other connections that lead from the input nodes to a hidden node or from the two hidden nodes to
the output node. Assume that all non-input nodes use the sigmoid activation function.
2. Implement the error function, which returns the mean squared error made by your network on 4
possible input vectors (0, 0),(0, 1),(1, 0),(1, 1) and the corresponding targets: 0, 1, 1, 0.
3. Implement the gradient of the mse(weights) function, grdmse(weights). Note that the vector of values
that are returned by grdmse(weights) should have the same length as the input vector weights: it
should be the vector of partial derivatives of the mse function over each element of the weights vector.
4. Finally, implement the gradient descent algorithm:
(a) Initialize weights to some random values,
(b) Iterate: weights = weights − η ∗ grdmse(weights),
where η is a small positive constant (called “step size” or “learning rate”).
2
Use your program to train the network on the XOR data. During training, monitor two values: the
MSE obtained by your network on the training set, and the number of misclassified inputs. (The network
returns a value between 0 and 1; we may agree that values bigger than 0.5 are interpreted as “1”, otherwise
as “0”.) Run your program several times using various initialization strategies and values of the learning
rate. Additionally, try the “lazy approach”: just keep generating random weights of the network, testing if it
computes the XOR function, and stop as soon as you have found such weights. To get an idea of how many
sets of weights should be tried before finding a good one repeat your experiment several times. Describe
your work and findings in the report.
You may experiment with alternative activation functions, e.g., hyperbolic tangent (tanh) or a linear
rectifier, relu(x) = max(0, x). How do they affect the training process of your network, how would you
explain these differences?