Behavior Classification by Eigen-decomposition of Periodic Motions
In this research we present a new framework for motion-based classification of moving nonrigid objects. The technique is based on the analysis of changing appearance of moving objects. We show how periodic motions can be represented by a small number of eigenshapes that capture the whole dynamic mechanism of the motion. Spectral decomposition of a silhouette of a moving object serves as a basis for behavior classification by principle component analysis.
Introduction
Futurism is a movement in art, music, and literature that began in Italy at about 1909 and marked especially by an effort to give formal expression to the dynamic energy and movement of mechanical processes. A typical example is the ‘Dynamism of a Dog on a Leash’ by Giacomo Balla, who lived during the years 1871–1958 in Italy, see
Fig.1In this painting one could see how the artist captures in one still image the periodic walking motion of a dog on a leash.

Fig.1. ‘Dynamism of a Dog on a Leash’ 1912, by Giacomo Balla, Albright-Knox Art Gallery, Buffalo.
Following a similar philosophy, we show how periodic motions can be represented by a small number of eigenshapes that capture the whole dynamic mechanism of periodic motions. Singular value decomposition of a silhouette of an object serves as a basis for behavior classification by principle component analysis. Fig. 2 present a running horse video sequence and its eigenshape decomposition. One can see the similarity between the first eigenshapes Fig. 2(c and d), and another futurism style painting “The Red Horseman” by Carlo Carra —Fig. 2(e).
![]()
(a)

(b)

(c) (d) (e)
Fig.2. (a) Running horse video sequence, (b) first 10 eigenshapes, (c,d) first and second eigenshapes enlarged, (e) ‘The Red Horseman’, 1914, by Carlo Carra, Civico Museo d’Arte Contemporanea, Milan.
Our approach
Our basic assumption is that for any given class of moving objects, like humans, dogs, cats, and birds, the apparent object view in every phase of its motion can be encoded as a combination of several basic body views or configurations. Assuming that a living creature exhibits a pseudo-periodic motion, one motion period can be used as a comparable information unit. The basic views are then extracted from a large training set. An observed sequence of object views collected from one motion period is projected onto this basis. This forms a parameterized representation of the object’s motion that can be used for classification.
Segmentation and tracking
Our approach is based on the analysis of deformations of the moving body. Therefore the accuracy of the segmentation and tracking algorithm in finding the target outline is crucial for the quality of the final result. This rules out a number of available or easy-to-build tracking systems that provide only a center of mass or a bounding box around the target and calls for more precise and usually more sophisticated solutions. Therefore, we decided to use the geodesic active contour approach
and specifically the ‘fast geodesic active contour’ method . The segmentation problem is expressed as a geometric energy minimization.Fig. 3
shows some results of moving object segmentation and tracking using the proposed method. Contours can be represented in various ways. Here, in order to have a unified coordinate system and be able to apply a simple algebraic tool, we use the implicit representation of a simple closed curve as its binary image. That is, the contour is given by an image for which the exterior of the contour is black while its interior is white.The boundary contour of the moving non-rigid object is computed efficiently and accurately by the fast geodesic active contours. After normalization, the implicit representation of a sequence of silhouette contours given by their corresponding binary images, is used for generating eigenshapes for the given motion. Singular value decomposition produces the eigenshapes that are used to analyze the sequence. We show examples of object as well as behavior classification based on the eigendecomposition of the sequence.

Fig.3. Non-rigid moving object segmentation and tracking
Periodicity analysis
We assume that the majority of non-rigid moving objects are self-propelled alive creatures, whose motion is almost periodic. Thus, one motion period, like a step of a walking man or a rabbit hop, can be used as a natural unit of motion. Extracted motion characteristics can by normalized by the period size. We address the problem using global characteristics of motion such as moving object contour deformations and the trajectory of the center of mass. By running frequency analysis on such 1-D contour metrics as the contour area, velocity of the center of mass, principal axes orientation, etc. we can detect the basic period of the motion. Figs. 4 and 5 present global motion characteristics derived from segmented moving objects in two sequences. One can clearly observe the common dominant frequency in all three graphs.
Fig.4.Global motion
characteristics measured for walking man sequence.

The period can also be estimated in a straightforward manner by looking for the frame where the external object contour best matches the object contour in the current frame. Fig. 6 shows the deformations of a walking man contour during one motion period (step). Samples from two different steps are presented and each vertical pair of frames is phase synchronized. One can clearly see the similarity between the corresponding contours. An automated contour matching can be performed in a number of ways, e.g. by comparing contour signatures or by looking at the correlation between the object silhouettes in different frames as we chose to do in our work. Fig. 7 shows four graphs of inter-frame silhouette correlation values measured for four different starting frames taken within one motion period.

Fig.6. Deformations of a walking man contour during one motion period (step). Two steps synchronized in phase are shown. One can see the similarity between contours in corresponding phases.
Fig. 7. Inter-frame correlation between object silhouettes. Four graphs show the correlation measured for four initial frames.
Frame sequence alignment
One of the most desirable features of any classification system is the invariance to a set of possible input transformations. As the input in our case is not a static image, but a sequence of images, the system should be robust to both spatial and temporal variations. In the section below we describe the methods for spatial and temporal alignment of input video sequence bringing it to a canonical, comparable form allowing later classification.
Spatial alignment
We want to compare every object to a single pattern regardless of its size, viewing distance and other imaging conditions. This is achieved by cropping a square bounding box around the center of mass of the tracked target silhouette and re-scaling it to a predefined size (see Fig. 8).

Fig.8. Scale alignment. A minimal square bounding box around the center of the segmented object silhouette is cropped and re-scaled to form a 50 x 50 binary image.
Temporal alignment
A good estimate of the motion period allows us to compensate for motion speed variations by re-sampling each period subsequence to a predefined duration. This can be done by interpolation between the binary silhouette images themselves or between their parameterized representation as explained below. Fig. 9 presents an original and re-sampled one-period subsequence after scaling from 11 to 10 frames. Temporal shift is another issue that has to be addressed in order to align the phase of the observed one-cycle sample and the models stored in the training base.


Fig.9. Temporal alignment. Top: original 11 frames of one period subsequence. Bottom: re-sampled 10 frames sequence.
Let us assume that in the training set all the sequences are accurately aligned. Then we find the temporal shift of a test sequence by looking for the starting frame that best matches the generalized (averaged) starting frame of the training samples, as they all look alike. Fig. 10 shows (a) the reference starting frame taken as an average over the temporally aligned training set, (b) a re-sampled single-period test sequence and, (c) the correlation between the reference starting frame and the test sequence frames. The maximal correlation is achieved at the seventh frame, therefore the test sequence is aligned by cyclically shifting it 7 frames to the left.

(a) (b)

(c)
Fig.10. Temporal shift alignment: (a) average starting frame of all the training set sequences, (b) temporally shifted single-cycle test sequence, (c) correlation between the reference starting frame and the test sequence frames.
Parameterization

Fig.11. The principal basis for the ‘dogs and cats’ training set formed of 20 first principal component vectors.
In order to reduce the dimensionality of the problem we first project the object image in every frame onto a low dimensional base. The base is chosen to represent all possible appearances of objects that belong to a certain class, like humans, four-leg animals, etc. Let n be number of frames in the training base of a certain class of objects and M be a training samples matrix, where each column corresponds to a spatially aligned image of a moving object written as a binary vector. In our experiments we use 5050 normalized images, therefore, M is a 2500
x n matrix. The correlation matrix MMT is decomposed using Singular Value Decomposition as MMT =USV T, where U is an orthogonal matrix of principal directions and the S is a diagonal matrix of singular values. In practice, the decomposition is performed on MTM, which is computationally more efficient. The principal basis {Ui , i=1, . . . , k} for the training set is then taken as k columns of U corresponding to the largest singular values in S. Fig. 11 presents a principal basis for the training set formed of 800 sample images collected from more than 60 sequences showing dogs and cats in motion. The basis is built by taking the k = 20 first principal component vectors. We build such representative bases for every class of objects. Then, we can distinguish between various object classes by finding the basis that best represents a given object image in the distance to the feature space (DTFS) sense. Fig. 12 shows the distances from more than 1000 various images of people, dogs and cats to the feature space of people and to that of dogs and cats. In all cases, images of people were closer to the people feature space than to the animals’ feature space and vise a versa. This allows us to distinguish between these two classes.

Fig.12. Distances to the ‘people’ and ‘dogs and cats’ feature spaces from more than 1000 various images of people, dogs and cats.
Fig. 13
shows several normalized moving object images from the original sequence and their reconstruction from a parameterized representation by back-projection to the image space.


5.23 4.66 4.01 3.96 3.42 4.45 4.64 6.10 5.99 6.94 5.89
Fig. 13 Image sequence parameterization. Top: 11 normalized target images of the original sequence. Bottom: the same images after the parameterization using the principal basis and back-projecting to the image basis. The numbers are the norms of the differences between the original and the back-projected images.
The numbers below are the norms of differences between the original and the back-projected images. These norms can be used as the DTFS estimation. Now, we can use these parameterized representations to distinguish between different types of motion. The reference base for the activity recognition consists of temporally aligned one-period subsequences. The moving object silhouette in every frame of these subsequences is represented by its projection to the principal basis. In the following experiment we processed a number of sequences of dogs and cats in various types of locomotion. From these sequences we extracted 33 samples of walking dogs, 9 samples of running dogs, 9 samples with galloping dogs and 14 samples of walking cats. Let S200xm be the matrix of projected single-period subsequences, where m is the number of samples and the SVD of the correlation matrix is given by SST = USSSVS. In Fig. 14 we depict the resulting feature points projected for visualization to the 3-D space using the three first principal directions {USi : i =1, . . . , 3}, taken as the column vectors of US corresponding to the three largest eigen-values in SS. One can easily observe four separable clusters corresponding to the four groups. Another experiment was done over the ‘people’ class of images. Fig. 15 presents feature points corresponding to several sequences showing people walking and running parallel to the image plane and running at oblique angle to the camera. Again, all three groups lie in separable clusters. The classification can be performed, for example, using the k-nearest-neighbor algorithm. We conducted the ‘leave one out’ test for the dogs set above, classifying every sample by taking them out from the training set one at a time. The three-nearest-neighbors strategy resulted in 100% success rate.

Fig. 14. Feature points extracted from the sequences with walking, running and galloping dogs and walking cats and projected to the 3D space for visualization.

Fig. 15. Feature points extracted from the sequences showing people walking and running parallel to the image plane and at 45◦
angle to the camera. Feature points are projected to the 3-D space for visualization.Publication
Behavior Classification by Eigen-decomposition of Periodic Motions. Pattern Recognition vol.38, (7), pp. 1033 1043, July 2005. R.Goldenberg, R.Kimmel, E.Rivlin and M.Rudzsky.