Blog post - July ??, 2018

Localizing actions in videos using objects


This blog post aims to provide an intuitive overview of our ICCV oral paper "Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions" [1]. Similar to our earlier blog post on pointly-supervised action localization, we aim to find the spatio-temporal location of actions in videos. This work attacks the problem from a different angle. The main focus of the work is to investigate whether it is possible to localize actions purely using knowledge about objects. In other words, the aim of the paper is to localized actions without training examples by way of transfer learning from objects. This paper is the first to fully investigate zero-shot action localization.

But why would we research whether it is possible to localize actions purely based on objects? Intuitively, we can conceptualize that objects provide a strict limitation on which actions are possible. Try surfing on a bicycle. Furthermore, actions inherently are what an actor does to act upon the environment and objects are the tools that enable us to changes the environment. Within the human visual system, we quickly realize to look beyond motion when learning to recognize and perform activities. As conceptual skills grow in children, they learn to categorize activities based on a complex interplay of spatial relations among objects [2]. Similarly, we are eventually able to quickly decompose and distinguish activities by examining the objects in the scene [3]. So within the human visual system, static objects serve as building blocks towards learning to recognize dynamic actions.

Given the role of objects for actions in the human visual system and empowered by leaps in progress in object detection [4], we feel the time is ripe to investigate to what extent object knowledge enables automatic action localization.

Scoring and linking boxes without training examples

In our zero-shot formulation, we are given a set of test videos V and a set of action class names Z. We aim to classify each video to its correct class and to discover the spatio-temporal tubes encapsulating each action in all videos. To that end, we propose a spatial-aware embedding; scored action tubes from interactions between actors and local objects. We present our embeddings in three steps: (i) gathering prior knowledge on actions, actors, objects, and their interactions, (ii) computing spatial-aware embedding scores for bounding boxes, and (iii) linking boxes into action tubes.

Prior knowledge. We aim to transfer knowledge about objects to actions. Therefore, we first need detectors to detect objects and actors locally in videos. Here, we use Faster R-CNN [4], pre-trained on MS-COCO. This network provides 80 detectors: 79 object detectors and an actor detector. Second, we need to match object names with action names. For this, we employ word2vec embeddings [5]. This results in a 500-dimensional representation per word, where we use the cosine similarity between the average word vectors for objects and actions as the similarity measure.

Third, we aim to incorporate the notion of spatial-awareness. Intuitively, this notion states that objects and actors have a preferred spatial relation when performing an action. E.g. a ball is near the feet of a person when kicking a ball and an umbrella is above a person walking in the rain. Here, we divide the spatial relation between object and actor into a 9-dimensional grid, representing the preposition in front of and the eight basic prepositions around the actor, i.e. left of, right of, above, below, and the four corners (e.g. above left). We gather these prior statistics from the same source as used to train the detectors themselves, namely MS-COCO. We gather and relate the box annotations in the training images for each object and average the relations. Spatial-aware examples for skateboard, bicycle, and traffic light are shown below.

Figure 1. Examples of preferred spatial relations of ob- jects relative to actors. In line with our intuition, skateboards are typically on or below the actor, while bicycles are typically to the left or right of actors and traffic lights are above the actors.

Scoring action boxes. We exploit our sources of prior knowledge to compute a score for the detected bounding boxes in all frames of each test video V. Given a bounding box b in frame F of video V, we define a score function that incorporates the presence of (i) actors, (ii) relevant local objects, and (iii) the preferred spatial relation between actors and objects. Below we show how all elements are combined into a single score.

Figure 2. Our spatial-aware score function for a single frame, where the green box is determined to be most relevant for the action Riding a horse, due to a high combined score for actor likelihood, relevant object likelihood, and expected spatial awareness.

The first part of the score function states that the score is proportional to the actor likelihood in the box. Naturally, an action requires an actor. The second part constitutes a sum over all objects. The score for an object is firstly proportional to the semantic similarity to action Z. Second, the score is proportional to its presence, which is the maximum score over all neighbouring boxes Fn, with that score proportional to both the object likelihood and spatial-awareness match. The spatial-awareness match is computed using the Jensen-Shannon Divergence between the object-actor relations in the frame and the prior computed spatial relations between object and actor.

Linking action boxes. Given a spatial-aware embedding score for each bounding box in each frame of a video, we apply the score function to the boxes of all actor detections in each frame. We form tubes from the individual box scores by linking them over time. We link those boxes over time that by themselves have a high score from our spatial-aware embedding and have a high overlap amongst each other. This maximization problem is solved using dynamic programming with the Viterbi algorithm. Once we have a tube from the optimization, we remove all boxes from that tube and compute the next tube from the remaining boxes. We use the average box score as the final score for each tube.

The result of the box scoring and linking is a set of tubes for an action in a set of test videos. This set can be ranked using the tube scores, which provides the final zero-shot action localization.

On global versus local objects

The zero-shot action tube algorithm outlined is based on the notion of localized objects that are relevant to the action. There is however a class of objects that are not relevant for action localization, but are relevant for action discrimination. To better understand this, let's examine a video frame for the action Kicking.

Figure 3. Visual example of the importance of both local and global objects. Local objects, such as the ball in this video frame, provide information about where an action is occurring. Global objects, such as a goal, commercial sign, or corner post, do not directly pinpoint actions, both do provide contextual information about which actions are likely to occur.

For the location of Kicking within the video, the ball object is intuitively critical. There are however many more objects in the scene, e.g. a goal, commercial sign, and corner post. While these objects do not directly pinpoint which actor is Kicking, they do provide contextual information. Contextual information from global objects do not help to localize actions within videos, but do help to discriminate action tubes from different videos. We therefore seek to also incorporate global object information in our approach. In short, we incorporate this information by computing object scores over whole videos using a network pre-trained on 13k ImageNet objects [6]. These objects are matches with actions again using word2vec. For an action, this results in a global score per video, which we add to the score of the computed tubes within the video.

Main experimental results

The influence of local objects and spatial awareness
To evaluate our approach, we first perform several ablation studies on the core components: (i) the influence of local objects, (ii) the influence of spatial awareness, and (iii) how many objects are required. Here we show the results on the UCF Sports dataset for both zero-shot action localization and classification. An overview of the results are shown below.

Table 1. On UCF Sports we compare our spatial-aware object embedding to two other embeddings; using only the actors and using actors with objects, while ignoring their spatial relations. Our spatial-aware embedding is preferred for both localization (one object per action) and classification (five objects per action).

The results above show that when only actor information is employed, random scores are achieved for both localization and classification. This is as expected, since all actions are actor-centric in current action localization datasets. When local object information is added, scores are improved, while the scores are further improved when incorporating spatial awareness. This shows that both elements positively impact the zero-shot action localization performance. Interestingly, we find that one object is preferred for localization, while five objects are preferred for classification. We found that when more objectd are used, the discrimination between actions increases slightly, at the cost of the precise action location. Since a loss in location accuracy does not affect the classification loss, more objects are preferred for classification.

Qualitative analysis
The results above provide an indication that spatial-aware object embeddings allow for zero-shot action localization, but when does the localization go right and when does it go wrong? Below we show four localization results for different actions in UCF Sports.

Figure 4. For Skateboarding and Riding a horse, relevant objects (blue) aid our localization (red). For Swinging on a bar and Kicking, incorrectly selected objects result in incorrect localizations. We expect that including more object detectors into our embedding will further improve results.

For the two leftmost examples, we see that the correct action localization is provided for Skateboarding and Riding a horse, due to both the detection of relevant objects and expected spatial awareness. For the two rightmost examples, the localization fails. For the action Kicking, the top object is tie according to the word2vec embeddings, while we expected the object ball to be most relevant. This is due to semantic ambiguity, since the word tie represents the visual object bowtie among our detectors, also can also mean a draw in a football match. As a result, the referee is employed as the localization for that video over the person kicking the ball. We found that this was because the actor probability was higher for the referee, which triggers the first part of our box scoring. For the action Swinging on a bar, we observe similar problems. The person swinging is missed here too because the person detection can not fully deal with the large variety of poses displayed by the person swinging in circles. We conclude that for better zero-shot action localization within our approach, we more desirable semantic mapping can boost the results further.

Comparative evaluation
Here, we also show some results of our approach compared to the state-of-the-art in action localization. We show again the results on UCF Sports, where we include global information as described earlier. The localization results for various overlap thresholds are shown below.

Table 2. Comparison to state-of-the-art for zero-shot action localization on UCF Sports. The only other zero-shot action localiza- tion approach is [7], which we outperform considerably. We also compare with several supervised alternatives. We are competitive, especially at high overlaps thresholds.

Only one paper previously attempted a zero-shot action localization experiment, namely in Jain et al. [7]. Their notion was to embed spatio-temporal proposals with object scores and map these scores to action names. Implicitly, their assumption is that objects should be on the actor, while we provide a generalized view through spatial-awareness on or around the actor. The results are also considerably in our favor, with an improvement of a factor of three to four at high overlap thresholds.

Compared to supervised approaches, we observe that the best approach performs better. We are however competitive to several supervised approaches, especially at high overlap thresholds. This result indicates the effectiveness of our zero-shot approach.

For all other experimental results, including the influence of global objects and a comparitive evaluation on UCF-101, J-HMDB, and Hollywood2Tubes, we refer to our paper [1].

Zero-shot action localization retrieval

Lastly, we want to show that our approach opens the door to new retrieval tasks. Since we only require the name of the action of interest. It becomes possible to perform on-the-fly localized action retrieval. Below, we show the results of several queries on J-HMDB.

Figure 5. Top retrieved results on J-HMDB given specified queries. Our retrieved localiza- tions (red) reflect the prescribed object (blue), spatial relation, and object size.

The examples above show that when directly specifying the object and spatial-awareness, we can yield relevant localized results. What is more, we can additionally specify the expected size of the object compared to the actor in the query, to further specify the desired result. The middle and right examples show that a difference in size for a sports ball results in localization results with different ball sizes, respectively a baseball and basket ball. We hope that this door leads to more research into retrieval beyond retrieving whole videos from a collection.


Objects provide an clear cue about which actions are occurring when and where in a video. This premise is intuitive, but was not yet investigated thoroughly. In this work, we provide an approach to localize, classify, and retrieve actions purely based on knowledge about objects, actors, and their spatial relations. This resulted in state-of-the-art zero-shot action localization and classification results, comfortably beating other approaches.

While the results are encouraging, they are far from practical and major improvements are required. We have already outlined limitations with respect to the semantic matches between objects and actions, but a more honest answer is that the approach of this work is limited and but a first step in zero-shot action localization.

Actions much more than objects and their spatial relation to objects. To take the next step in zero-shot action localization requires incorporating temporal knowledge, most notably a spatio-temporal awareness between objects and actors. Think about the object basketball for example. An actor dribbling and holding a ball are different actions, but employ the same objects. What differentiates them is their temporal footprint. Beyond spatio-temporal object-actor relations, this line of work can inherently not deal with actions that do not require objects, such as Running. This requires spatio-temporal knowledge of the actors themselves.


[1] Mettes, PAscal and Cees G M Snoek. "Spatial-aware object embeddings for zero-shot localization and classification of actions." ICCV (2017).
[2] Baillargeon, Renée, and Su-hua Wang. "Event categorization in infancy." Trends in cognitive sciences 6.2 (2002): 85-93.
[3] Shipley, Thomas F., and Jeffrey M. Zacks. Understanding events: From perception to action. Oxford University Press (2008).
[4] Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." NIPS (2015).
[5] Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS (2013).
[6] Mettes, Pascal, et al. "The imagenet shuffle: Reorganized pre-training for video event detection." ICMR (2016).