Tuesday, 17.12.2013, 11:30
Understanding human attention have interested researchers for decades. The early works come from different fields of psychology and separate the cognitive process into
several steps. The different models of attention in static scenes have emerged and
evolved into dynamic saliency. Along with that, there are extensive cinematographic
theories on how the scene should be watched, or filmed. And again, there is a long
term research interest in view selection for static and dynamic scenes. Different
methods propose how to place a camera in a scene and how to move it.
The central contribution of this research is a novel approach to video saliency
modeling. We propose a model that can effectively predict humans' attention in
any particular video. The system is learn from human examples, so our second
contribution is an effective method for massive collection of gaze data. We adapt
our model to multiple camera scenarios by proposing an approach for view selection
based on fixed cameras. As the last contribution we propose a method to shift human
attention by inlaying artificial objects into a video.
Our model for video saliency is based on modeling gaze as attention shifts between
consecutive video frames. This is different from analyzing each image independently,
as was often done before and allows us to maintain temporal stability of the saliency
maps. We incorporate static, motion and semantic features from the video to propagate
a saliency map from one frame to another. We show that this model is better to the
behavior of the human eyes.
Since our saliency model is learn from a large database of human gaze tracks we
additionally propose a method to collect them from any number of participants. The
method employs crowdsourcing technique and allows to record gaze location on any
number of frames of any video. Opposite to the traditional gaze tracking methods,
our method does not require any special equipment and participants are not limited
by any geography or culture.
As an approach to multiple camera setups we propose a method for efficient viewpoint
selection from any set of cameras that view the same scene. As placing a camera
at specified place usually requires knowledge of 3D data our method works with fixed
cameras. It is capable of ranking the cameras according to the visibility of the actions
happening in the scene. After the best view is selected the video saliency method can
be applied to the resulting set of frames.
We further wish to edit the input video and shift the humans' attention. To do
so we propose a user-friendly system for seamless inlaying of any 3D object into any
video. We model the video as a single image, ask the user to add the object in the
desired place and then render it back to the video.
To verify the proposed methods we test them on known video datasets and on real-life
videos. We compare our results quantitatively to the state-of-the-art methods and
outperforms them. Additionally, we present qualitative tests that render our results
more visually appealing that the previous approaches.