UDACITY SDCE Nanodegree: Term 1- Project 5: Vehicle Detection!

6 min readMar 21, 2017

Through this project an algorithmic pipeline was developed capable of detecting and tracking vehicles.

The goals / steps of this project are the following:

Perform a Histogram of Oriented Gradients (HOG) feature extraction on a labeled training set of images and train a classifier Linear SVM classifier
Optionally, you can also apply a color transform and append binned color features, as well as histograms of color, to your HOG feature vector.
Note: for those first two steps don’t forget to normalize your features and randomize a selection for training and testing.
Implement a sliding-window technique and use your trained classifier to search for vehicles in images.
Run your pipeline on a video stream (start with the test_video.mp4 and later implement on full project_video.mp4) and create a heat map of recurring detections frame by frame to reject outliers and follow detected vehicles.
Estimate a bounding box for vehicles detected.

Histogram of Oriented Gradients (HOG)

I started by loading in all the vehicle and non-vehicle images (KITTI and GTI). A total of ~8000 images per class where used. I split the data into training and test sets. I used 20% of each class as test in order to get an indication of the false positive and false negative rates and see how the classifier responds to each class individually. Here is an example of one of each of the vehicle and non-vehicle classes:

I then explored different color spaces and different skimage.hog() parameters (orientations, pixels_per_cell, and cells_per_block). Aslo I tried different classifiers (linear SVM, polynomial SVM, Naïve Bayes, and decision trees). The Naïve Bayes and decision trees consistently produced lower accuracy than the SVM; while the polynomial SVM was much slower at test time, although more accurate, than the linear SVM. Hence, I selected a linear classifier for my final solution. With regards to the feature space I ended up using the YCrCb color space with a combination of HoG on all three color channels and color histogram features. The spatial histogram resulted in more false positives so I dropped it from the feature process. I also experimented with other color spaces but the YCrCb produced more robust results. The final HoG parameters are orientations=12, pixels_per_cell=(10, 10) and cells_per_block=(2, 2). An example using some random images from each of the two classes are displayed below:

YCrCB channel representations and HoG transformations for non-car image (left) and car image (right)

I tried various combinations of parameters for the HoG transform. First I started by using only the HoG transform on a Grayscale image, without spatial binning or histogram. Even though this approach was fast it produced a lot of false positives. Hence, I tried to use all three channels for the HoG feature extraction. From the different color transformations (RGB, HSV, LUV, HLS, YUV, YCrCb), YCrCb produced the higher accuracy on the test set and so it was chosen as the final color transformation. However, after further testing on the test video the performance was not acceptable as the detection of different colored cars was not good. Intuitively, it was clear that some color information needed to be preserved in order to have some robustness whenever some HoG features are not detected due to the specific car color palate and contrast. First, I tried using both spatial binning and color histogram. Even though the test set accuracy increased the false positive rate on the test images increased. Intuitively, I think that this might be because the spatial binning may be to specific and does not abstract the information as well and so may end up overfitting the classifier. Hence, for the final set of experiments I tried using only color histogram with HoG. My intuition was correct as the results were improved. Finally, I also experimented with the HoG parametets. To be honest, I did not notice much difference on the accuracy for orientations between 8–12 and pixels_per_cell sizes between 8–10 so I ended up using orientations=12, pixels_per_cell=(10, 10) and cells_per_block=(2, 2) which marginally produced better results. One major difference was when using larger orientation values and larger cells_per_block size. In both cases the test speed increased considerably due to the larger vector size and also the accuracy decrease mainly because the feature space was much larger (> 12000) than the actual training size (~12 000) which is not a desirable property for SVMs. The final extracted features are normalized using the sklearn.preprocessing. StandardScaler(). The training information for pre-processing and training were saved in respective pickle files and loaded for use at run time. Results of the normalization process for random samples are shown below:

Initial Image Feature Vectors (Left) — Normalization Results for image features (right)

Classifier Training

I tried different classifiers (linear SVM, polynomial SVM, Naïve Bayes, and decision trees). The Naïve Bayes and decision trees consistently produced lower accuracy (code is commented in file train_classifier.py) than the SVM; while the polynomial SVM was much slower at test time, although more accurate, than the linear SVM. As a result, I used a linear SVM classifier which achieved the based trade-off between speed and accuracy. I also tried to play around with the C value of the linear SVM changing it from 0.01–100 but I did not notice any major changes in performance. Here are the detailed results of the classifier training.

Sliding Window Search

The sliding window search is implemented by calling the search_window process is called successively for different image scales. For each scale windows of different sizes with overlap of 0 are extracted and resized to 64x64 for classification. The actual sizes to search was a trial and error approach. At first with a single scale even though the detection process was faster not many cars were detected as they were bigger or smaller than the actual size. In order to detect multiple size cars multiple scales needed to be searched. Four scales provided a good trade-off between the processing speed (~6 second per frame) with the size of cars detected.

The windows searched at different scales

At the end, I searched on four scales using YCrCb 3-channel HOG features plus histograms of color in the feature vector, which provided a good result. Here are some detection examples:

Heat-Map Optimizations

In order to track detections in the spatial as well as the temporal dimension I implemented a rolling window approach to track the heat-maps and filter out false positives. First, I capture the first two frames (which have a heat-map threshold of 0 and 1 respectively). Then I keep track of the last 3 frame detections and heat-maps to eliminate false detections. Specifically, for a region to be retained it must have been detected at least 4 times. I then assumed each blob corresponded to a vehicle. I constructed bounding boxes to cover the area of each blob detected. Here’s an example result showing the heat-map from a series of frames of video and the bounding boxes for the rolling window of three frames. Here are six successive frames and their corresponding heat-maps:

Discussion

Overall, I observed that my approach worked well on the test video. There were some false detections but nothing persistent. Those can be removed by training with a bigger training set or using more advanced methods such as convolutional neural networks. Also in some cases the car is not detected as frequently. This can be due to the fact that its size is smaller than what the one searched by the different scales. Overall, the major drawback of the approach is the processing time. As it stands it is very slow to be executed in real-time. Some form of GPU acceleration can be explored or even reducing the image resolution. Another approach would be to use a cascade classifier with varying feature complexity which can eliminate many regions really fast. Overall, this project was very interesting and the approaches that we learned have a very large application spectrum. Below is a video demonstrating my approach.