DeepCamera: Following Targets Without Bounding Boxes… End-to-end active vision

Christos Kyrkou
5 min readMar 13, 2021

This work proposes a supervised learning technique to control active cameras with Deep Convolutional Neural Networks to go directly from visual information to camera movement.

Active Camera Systems

Active vision systems (i.e., movable cameras with controllable parameters such as pan and tilt) can provide extended coverage, flexibility, and cost-efficiency, compared to static vision systems running in the cloud. This is particularly appealing for rapid deployment of cameras or temporary installations for particular affairs or other events. The cost saving over buying and installing a hard wired system for temporary or remote locations is immense. In addition, automated vision processing is necessary since there is a limited number of cameras that human operator can monitor and control. So it is important to provide with the operator with only relevant aggregated data and information about certain events. By leveraging recent advances in deep learning through convolutional neural networks, we can enable advanced perception through efficient optimized ConvNets. In addition, we can also enable the direct control of cameras for active vision without having to rely on hand-crafted pipelines that incorporate various aspects of detection-tracking-control.

Active Mobile and Autonomous Vision Systems

Visual Active Monitoring

The objective of active vision in contrast to a static camera setting is to change the control parameters of a camera in order to maximize a performance objective such as keeping targets in its field-of-view (FoV) to improve the overall surveillance and monitoring capabilities. To achieve this task we need to obtain the displacement between the center of mass of one or more targets and the camera image center. We assume that all targets have equal importance and the goal is to try and monitor as many targets as possible. Traditionally, the active monitoring task has been handled through a pipeline of modules such as detection, filtering, tracking, and control which is difficult to optimize jointly. In this work we frame active visual monitoring as an imitation learning problem to be solved in a supervised manner using deep learning, to go directly from visual information to camera movement

Visual Active Monitoring Problem

Imitation Learning

We tackle this problem using behavioral cloning which is an imitation learning approach which focuses on learning the expert’s policy using supervised learning.

1.Collect demonstrations from expert

2.Treat the demonstrations as i.i.d. state-action pairs: (s0, a0),(s1, a1),…

3.Learn a policy using supervised learning by minimizing the loss function L(a,π (s))

Such approach is useful when it is easier for an expert to demonstrate the desired behaviour rather than to specify a reward function which would generate the same behaviour or to directly learn the policy.

Training Data Generation

Proper training and testing data are necessary to train a deep CNN regressor for the visual active monitoring task. For this reason a framework is developed that allows for i) simulating the behaviour of active cameras using real-world images, ii) Capturing and storing multiple frame sequences with ground truth data that can be used for bounding box, density, and camera control, iii) evaluate performance of active vision algorithms in a realistic environment in similar conditions and controlled experiments.

Active Camera Deep Controller Network (ACDCNet)

The active camera deep controller network was designed to be computationally efficient,

Convolutional Feature Extractor: The layers were designed to perform feature extraction and were chosen emppirically through a series of experiments that varied layer configurations. There are 7 major blocks each comprised of a convolutional layer with leaky relu activation with α = 0.3 and batch-normalization layer and dropout is applied. The first 3 layers downsample the image to reduce the computational cost and have a small number of filters, and overall the filters do not exceed 128.

Controller Subnetwork: The controller subnetwork is comprised of both convolutional as well as fully-connected layers. The idea behind this is that the convolutional layers will condense the information from the feature extractor; then the final convolution will convert the feature map into a vector

Overall, has a total of 386; 000 parameters. This results in a small network which requires 4MB, resulting in a lightweight network that can run even on low-end CPUs.

Active Camera Deep Controller Network (ACDCNet)

The output of the controller subnetwork is further processed through a clipped linear activation that bounds the output between [-1,…, 1] to estimate the motion in the horizontal and vertical direction more effectively. The third output neuron that regresses the number of targets uses a ReLU function to discard negative numbers.


Want to investigate how many targets have been monitored by different approaches over time. Since there is no available dataset for this task we repurpose the PETS2009 surveillance dataset. For comparison we use some methods commonly used to build active monitoring pipelines such as SVM person detector that is availavle in OpenCV and commonly use din low-power applications, the YOLO detection network and for both cases we employ tracking with Kalman Filtering that is commonly used in relevant works for the video experiments. In addition, we compare with an oracle approach that knows the localization information for each target.

As we can see the oracle approach with ground truth information follows 3 targets per frame. The tracking components for YOLO and SVM detectors manage to increase the performance of each respective detection method the effect of which was not observed in the previous experiment. The proposed CNN regressor manages to slightly bit the other methods and be close to the oracle. From observing the results we can see that the worse performance of YOLO and SVM can be attributed to their overreliance on bounding boxes where missing detections and inaccurate localizations can lead to worse performance.

Results on Image Sequences

Compared to explicit decomposition of the problem, such as people detection, state estimation, and control, the proposed end-to-end system optimizes all processing steps simultaneously learns the best features to associate with camera control for visual active tracking purposes


The objective of this work was to go directly from visual information to camera movement. The problem is formulated in wasy that it can be solved by deep learning in an end-to-end manner. A CNN tailored for active monitoring tasks for smart cameras takes as input raw video frames and outputs the change in camera movement along the pan and tilt axes and counts the number of targets. Overall, a less complex approach with reduced computational costs after training that can lead to simpler smart camera system implementation

More information can be found in the papers:


[1] Christos Kyrkou, “Imitation-Based Active Camera Control with Deep Convolutional Neural Network”, IEEE International Conference on Image Processing Applications and Systems, December 2020.

[2] Christos Kyrkou, “C³Net: end-to-end deep learning for efficient real-time visual active camera control”, Journal of Real-Time Image Processing , 2021.



Christos Kyrkou

Research Lecturer at the KIOS Research & Innovation Center of Excellence at the University of Cyprus. Research in the areas of Computer Vision and Deep Learning