Pixel Club: Video Saliency and its Applications in Single and Multi-camera Setups

Speaker:
Dmitry Rudoy (Technion)
Date:
Tuesday, 17.12.2013, 11:30
Place:
EE Meyer Building 1061

Understanding human attention have interested researchers for decades. The early works come from different fields of psychology and separate the cognitive process into several steps. The different models of attention in static scenes have emerged and evolved into dynamic saliency. Along with that, there are extensive cinematographic theories on how the scene should be watched, or filmed. And again, there is a long term research interest in view selection for static and dynamic scenes. Different methods propose how to place a camera in a scene and how to move it. The central contribution of this research is a novel approach to video saliency modeling. We propose a model that can effectively predict humans' attention in any particular video. The system is learn from human examples, so our second contribution is an effective method for massive collection of gaze data. We adapt our model to multiple camera scenarios by proposing an approach for view selection based on fixed cameras. As the last contribution we propose a method to shift human attention by inlaying artificial objects into a video. Our model for video saliency is based on modeling gaze as attention shifts between consecutive video frames. This is different from analyzing each image independently, as was often done before and allows us to maintain temporal stability of the saliency maps. We incorporate static, motion and semantic features from the video to propagate a saliency map from one frame to another. We show that this model is better to the behavior of the human eyes. Since our saliency model is learn from a large database of human gaze tracks we additionally propose a method to collect them from any number of participants. The method employs crowdsourcing technique and allows to record gaze location on any number of frames of any video. Opposite to the traditional gaze tracking methods, our method does not require any special equipment and participants are not limited by any geography or culture.

As an approach to multiple camera setups we propose a method for efficient viewpoint selection from any set of cameras that view the same scene. As placing a camera at specified place usually requires knowledge of 3D data our method works with fixed cameras. It is capable of ranking the cameras according to the visibility of the actions happening in the scene. After the best view is selected the video saliency method can be applied to the resulting set of frames.

We further wish to edit the input video and shift the humans' attention. To do so we propose a user-friendly system for seamless inlaying of any 3D object into any video. We model the video as a single image, ask the user to add the object in the desired place and then render it back to the video.

To verify the proposed methods we test them on known video datasets and on real-life videos. We compare our results quantitatively to the state-of-the-art methods and outperforms them. Additionally, we present qualitative tests that render our results more visually appealing that the previous approaches.

Back to the index of events