UDACITY SDCE Nanodegree: Term 1- Project 2: Traffic Sign Classifier!
The purpose of this project was to use deep neural networks and specifically convolutional neural networks, to classify traffic signs. It is implemented in TensorFlow in a python notebook environment.
The traffic sign classifier is trained on the German Traffic Sign Dataset. There are a total of 39209 training samples and 12630 testing samples from 43 classes. The images are of size 32x32x3. Examples of each class are visualized below:
The image samples are distributed per class as shown below:
In this project we were asked to preprocess the images. The main preprocessing technique I used was to grayscale the images. I used this approach since many signs had similart color patterns and hence, no real advantage would come from using color expect from some cases. Furthermore, the grayscale approach was found to produce better results as shown in . In addition, in my experiments I also tried to shift the data to zero mean by subtracting the mean of the training set from all the data per each dimension. Even though in theory this approach is suggested I found that it performed worse by 3–4% in terms of testing accuracy. Hence, I proceeded with just the grayscale conversion as a preprocessing technique. I didn’t use any data augmentation and ended up with a testing accuracy of 95%. Enhancing the dataset can for sure lead to better results as also shown in .
To begin with I used 30% of my training data for validation. Then I tried using 20% for validation in order to increase the size of the training data. In order to generate the validation set I did not randomly select 20% from all the training data but rather selected 20% of the data of each class in order to make the validation set more representative and test for all cases.
I have tested three different architectures, the LeNet (lenet function) from the lab and two deeper ones (functions signRecNet and signRecNet_deeper). The first has 4 convolutional layers, two at each side of a pooling layer (2 with 5x5 filters and 2 with 3x3 filters). The second architecture which I ended up using has 5 convolutional layers all of filter of size 3x3 with two pooling layers to achieve a more graceful degradation of the input image and collect features at more hierachies. All architecture had the same number of neurons for the fully conencted layers.
The final architecture I used is as follows:
Initial training experiments a grid search approach with learning rate 0.001, and a batch size of 256. These parameters where chosen since they are typical values for training convnets for image recognition. The larger batch size (rather than 64 for LeNet lab) can lead to less noisy updates of the weights. In most of the simulations I run the experiments for 50 epochs, however, in the final rounds I changed it to 100 epochs to see if any improvement could be made. It appears that the model objective saturates fairly quickly and hence, the increased number of epochs do not help. This is to be expected with an increased batch size, however. Finally, I used the AdamOptimizer as in the LeNet Lab since it performs better than standard SGD.
Overall, the process of tunig the convNet was through trial and error. I first started with the LeNet architecture and by changing the number of filters per layer. I performed a grid search for the number of filters for values 8,16, and 32. I noticed that as the number of filters increased the validation and test accuracies where reduced which may indicate overfitting. Overall, with grayscale preprocessing I managed an accuracy of around 88–92% with LeNet.
For further experiments I tried different architectures.
1) First I used the ConvNet defined in function which constituted of four convolutional layers. The initial two convolutional layers had 5x5 filters and the latter had 3x3 filters. This produced an accuracy of 93–94% on the test set.
2) Then I tried to make the network deeper (architecture defined in function), however, in order to gradually decrease the image resolution I changed all the filter sizes for the convolutional layers to 3x3. This produced an additional performance increase of 1–2%.
In both these cases the number of convolution filters progressivelly increase from 8 to 16 to 32. Also I used dropout layers, first a dropout with 50% keep rate between In all the experiments I used the same number of neurons for the fully connected layers (except the last output layer) as the LeNet Lab. It is possible to also explore a better architecture for the fully connected layers to better suite the problem, but I did not attempt this for this project.
Tersing on New Images
The images where chosen randomly from the web. They where rescaled to 32x32. The size and orientation where selected randomly. The first image is a drawn sign and not from a real scene, however, the features are clearly shown so it is expected that a well trained network will still be able to predict its class correclty. The third image even though at first glance appears similar to the ones in the database, has a different scale. The final image is not even in the database and so it is interesting to see what the network predicts. The other two images appear similar to the ones in the database.
The model has an accuracy of 60% on the 5 images. As such it is able to predict the correct class for three images. The accuracy on the new images is quite low from that of the validation (94%) and even test (93%) sets. Some possible reasons for this are highlighted below.
The model achieves a top 1 error of 40%. The top 5 error is also 40% as when a wrong prediction is made the correct class is not within the top 5 options.
Furthermore, the model appears to be quite confident in all the predictions, even though some are wrong. However, upon close inspection you can see why it was fooled for images 3 and 5. Image 3 has a different size to what the speed signs in the database have. The sign outline reaches the border of the image whereas in the training images there contain sufficient background. For image 5, the roundabout signs in the database are circle shaped whereas this is a triangle. The network is influenced greatly by the shape. Furthermore, the arrows within the image also much the left turn and that is why it is selected.
Through this project I was able to first get accustomed to TensorFlow and develop and fine tune a deep learning model for traffic sign classification and achieve reasonable accuracy. There are many ways to improve performance. First, by presenting more samples to the network through data augmentation. Second, by trying deeper models with a more graceful reduction of the image resolution. Third, by applying more preprocessing techniques such as data whitening and image brightness normalization. You can find the github repo for my project below. In addition, for the purposes of this project I have also created a quick guide for TensorFlow which you can see below.
 P. Semanet, Y. LeCun, “Traffic Sign Recognition with Multi-Scale Convolutional Networks”