UDACITY SDCE Nanodegree Term 3 — Project 2: Advanced Deep Learning and Semantic Segmentation
The objective of this project is to label pixels corresponding to road in images using a Fully Convolutional Network (FCN).
Introduction
This specific module was a collaboration between UDACITY and NVIDIAs Deep Learning Institute. This module covers semantic segmentation, and inference optimization.
Semantic segmentation identifies free space on the road at pixel-level granularity, which improves decision-making ability. Inference optimizations accelerate the speed at which neural networks can run, which is crucial for computational-intense models like the semantic segmentation networks
The objective is to build and train fully convolutional networks that output an entire image, instead of just a classification output. For this puprose we implement three special techniques that FCNs use: 1x1 convolutions, upsampling, and skip layers, to train our own FCN models.
Overview
Starting from canonical models like VGG which are trained on millions of images and thus form very good features for visual understanding the goal is to build a fully convolutional neural network for semantic segmentation to identify free space on the road. For this project we apply techniques such as fully convolutional networks, transposed convolutions, and skip connections to create a semantic segmentation model that classifies each pixel of free space on the road. This process is accelerated using inference optimizations like fusion, quantization, and reduced precision.
Transposed Convolution
Transposed Convolutions help in upsampling the previous layer to a higher resolution or dimension. Upsampling is a classic signal processing technique which is often accompanied by interpolation. We can use a transposed convolution to transfer patches of data onto a sparse matrix, then we can fill the sparse area of the matrix based on the transferred information.
Semantic Segmentation
The semantic segmentation network has an hour-glass structure and is comprised of an encoder (feature extractor) and a decoder (image reconstruction). The encoder is the VGG16 model pretrained on ImageNet for classification. The fully-connected layers are replaced by 1-by-1 convolutions and the decoder part is composed of transposed convolutions to upscale the feature maps and produce the segmentation image. To build the decoder portion we upsample the input to the original image size. The shape of the tensor after the final convolutional transpose layer will be 4-dimensional: (batch_size, original_height, original_width, num_classes). The final step is adding skip connections to the model. In order to do this we’ll combine the output of two layers. The first output is the output of the current layer. The second output is the output of a layer further back in the network, typically a pooling layer. The final step is to define a loss. That way, we can approach training a FCN just like we would approach training a normal classification CNN.
In the case of a FCN, the goal is to assign each pixel to the appropriate class. We already happen to know a great loss function for this setup, cross entropy loss! Remember the output tensor is 4D so we have to reshape it to 2D. That’s it, we now have an end-to-end model for semantic segmentation. Time to get training!
The training of the network was carried out using Amazon Web Services using the KITTI Road Dataset. The deep learning framework used was Tensorflow, with python and scipy for data manipulation. The final network was able to identify road segments in the images. Examples are shown in the video below.