Detecting Objects in images with machine learning tools

Object detection deals with determining whether an object of interest is present in an image/video frame or not. It is a necessary task for embedded vision systems as it enables them to interact more intelligently with their host environment, and increases their responsiveness and awareness with regards to their surroundings.

The detection/discovery of visual objects is a perceptual and cognitive task fundamental to vision and intelligence. It can be useful for a wide range of embedded applications ranging from robotics, surveillance and census systems, human-computer-interaction, intelligent transport systems, and military.

Object Detection Process

The process of visual object detection deals with determining whether an object of interest is present in an image/video frame or not: regardless of its size, orientation, and the environmental conditions which is found in. The high degree of variability makes it difficult to describe an object analytically by following an algorithmic step by step approach. Hence, object detection is typically viewed as a machine learning pattern recognition problem where the goal is to given an image to classify it as an object or non-object. There are different methods used to perform object detection the most notable of which are:

  • Knowledge-based methods: These techniques are based on rules that codify the human knowledge about the object of interest and its characteristics.
  • Feature invariant techniques: These methods consists of finding structural object features that remain invariant regardless of pose, lighting conditions, or viewpoint.
  • Template matching methods: Several standard patterns/models are stored to describe the object as a whole or as different components. The correlations between the stored models and input are computed to perform detection.
  • Appearance-based methods: In contrast to template-based methods the models are learned by examples of objects and non-objects, through supervised machine learning algorithms (Support Vector Machines, Neural Networks, etc.), which find relationships between data instances and classes to capture the variability of visual appearance.

Obviously, the above methods are interrelated and can be used together in order to provide higher and more robust detection accuracy. Appearance-based methods constitute the most popular detection approaches but they are often combined with knowledge-based or feature invariant techniques to improve detection performance. These types of methods obtain good results due to the fact that they can generalize well given that the variability in the object appearance can be captured by the given training set and the chosen features offer adequate descriptive capabilities. Moreover, the incur a lower computation cost compared to other methods.

Sliding-Window-based Object Detection

The overall visual object detection process begins by first receiving an input image/video frame from a camera or other adequate image source, which subsequently will then be searched in order to find possible objects of interest. This search is done by extracting smaller regions from the frame, called search windows, of m X n pixels, which are processed by a classification algorithm to determine if they belong to the object of interest class or not. The search window size is such so that it corresponds to the size of the object of interest. Thus, the classification algorithm learns to categorize search windows of a particular size. However, the object of interest may appear in the image/video frame at a larger size than the size of the search window. In such a case, the classification algorithm will not be able to detect the object. To account for this scenario an object detection system may either increase the size of the search window, or decrease the size of the input image (downscaling), effectively reducing the size of the object of interest, and then reexamines the downscaled image with the same search window size. The latter process is often preferred as it is more efficient as the former requires training many classifiers, one for each window size, and also to process large images as the window size increases. On the other hand, the former approach requires training only a single classifier for the targeted window size. The downscaling process is done in steps to account for various object sizes, down to the size of the search window and scaling happens by mapping old coordinates to new ones using a scaling factor. Hence, many downscaled images are produced from a single input image/video frame, each in turn producing a number of search windows, which increases the amount of data that must be processed by the classification algorithm. Search windows can be extracted from every pixel location in the image (exhaustively) or every few pixels. The term which determines the distance between successive search windows is called the . This window step is application specific and is relative to the size of the object of interest. Small objects can appear within a distance of a few pixels between them and as such, usually a small window step is chosen, whereas for larger search windows the window step can be increased.

Each window that is extracted from the image is processed to account for different lighting conditions and other environmental variations, or to extract meaningful features which are used for classification. These features can either be shape, color, intensity, and responses of various filters and feature extraction algorithms (edges, local binary patterns, Haar wavelets, histograms, etc.). Using features makes the detection process more robust since it provides a more representative description of the object and reduces the within-object-class variability. However, the addition of feature extraction approaches and preprocessing methods can have a negative effect on the classification speed even though the accuracy can be improved.


It is important to consider the metrics used to measure the performance of an object detection system. An image object detection system is characterized by how accurately it can classify data as well as how many image frames it can process per second. Thus, the two commonly used performance metrics are the detection accuracy, and frames-per-second (FPS). Detection accuracy is usually measured on a given test set where the expected outcome for a sample is compared to the actual outcome of the object detection system. The detection accuracy is the percentage of samples for which the expected outcome matches the actual outcome of the detection system. FPS concerns the throughput of a system and is the maximum number of digital video/image frames, of a given size, that the detection system can process in one second. A minimum performance of 30 FPS is often required in order for an object detection system to be capable for real-time video processing. However, depending on the application higher frame-rates may be necessary thus higher system performance is needed. This is typically the case if other image processing and recognition algorithms have to coexist with detection, or if multiple video feeds from different sources need to be processed.

PhD in Computer Engineering, Self-Driving Car Engineering Nanodegree, Computer Vision, Visual Perception and Computing

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store