Technical Report PHD-2013-01

Title: Understanding Events in Video
Authors: Gal Lavee
Supervisors: Professor Ehud Rivlin
Abstract: Video events are those high-level semantic concepts that humans perceive when observing a video sequence. Understanding these concepts is the highest level task in computer vision. It relies on sufficient solutions to many lower-level tasks such as edge detection, optical flow estimation, object recognition, object classification and tracking. The maturity of many solutions to these low-level problems has spurred additional interest in utilizing them for higher level tasks such as video event understanding. In this thesis we map the diverse literature in the research domain of video event understanding. First we construct a taxonomy of this research domain and apply this taxonomy to categorize many leading works. The terminology of the video event understanding research domain is often confusing and ambiguous. Many terms such as ”events”, ”actions”, ”activities”, and ”behaviors” are often used in different ways across the literature. In this thesis we provide an in-depth discussion of this ambiguity and suggest a terminology, which we then apply throughout the remainder of the thesis, that allows unification and comparison of the various works in this research domain. Our contribution to the research domain of video event understanding focuses on events defined by complex temporal relationships among their sub-event building blocks. We explore the representative power of the Petri Net formalism to model these events. Our early work describes an approach for modeling scenes where Petri Net place nodes represent states scene objects may take on. Petri Net transition nodes represent changes in the properties of scene objects. Petri Net tokens in this model represent scene objects. Recognition of events is achieved deterministically by tracking the properties of scene objects and propagating their representative tokens throughout the Petri Net model of the scene. This approach does allow for some variance in the duration of sub-events via the use of stochastic timed transitions. Our later work focused on constructing a Petri Net model of an event that is robust to the various kinds of uncertainty inherent to surveillance video data. In this approach a Petri Net modeling the temporal constraints of the event is constructed by a domain expert. The Petri Net is laid out as a “plan” where token(s) are advanced from an initial place node to a final “recognized” place node as external sub-events are observed in a manner that is consistent with the definition of the event. In order to deal with the fact that sub-events are, in general, only observed up to a particular certainty, we define a transformation from the Petri Net definition into a probabilistic model. Within this probabilistic model, well-studied approaches, such as the Particle Filter algorithm, afford elegant reasoning under uncertainty. Furthermore, online reasoning, which is required by many of our motivating scenarios, is also enabled. In many areas of the video event understanding domain, particularly surveillance applications, we are often interested in differentiating between similar events that differ only by the configuration of their constituent sub-events. Since these events exist within the same scene, they are limited by the same physical (or context) constraints. These context constraints are independent of the constraints that define the temporal ordering of sub-events. Our most recent work has focused on applying this intuition and constructing event models which separately model context and noncontext constraints. This separation, affords simpler event models, reduces the complexity of the probabilistic inference, and ultimately improves both recognition performance and efficiency. One main contribution of this thesis to the research domain of video event understanding is the representation scheme that decouples context constraints from the temporal constraints that define the event. Another area of contribution are the recognition algorithms proposed which are constructed on top of the Petri Net representation of the event domain. These algorithms are generalized to be able to cope with the uncertainty inherent in video and afford elegant probabilistic reasoning which can be updated as new information becomes available.
CopyrightThe above paper is copyright by the Technion, Author(s), or others. Please contact the author(s) for more information

Remark: Any link to this technical report should be to this page (, rather than to the URL of the PDF files directly. The latter URLs may change without notice.

To the list of the PHD technical reports of 2013
To the main CS technical reports page

Computer science department, Technion