This blog post aims to provide an intuitive overview of our ICCV 2017 paper "Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions" . Similar to our earlier blog post on pointly-supervised action localization, we aim to find the spatio-temporal location of actions in videos. This work attacks the problem from a different angle. The main focus of the work is to investigate whether it is possible to localize actions purely using knowledge about objects. In other words, the aim of the paper is to localize actions without training examples by way of transfer learning from objects. This paper is the first to fully investigate zero-shot action localization.
But why would we research whether it is possible to localize actions purely based on objects? Intuitively, we can conceptualize that objects provide a strict limitation on which actions are possible. Try surfing on a bicycle. Furthermore, actions inherently are what an actor does to act upon the environment and objects are the tools that enable us to changes the environment. Within the human visual system, we quickly realize to look beyond motion when learning to recognize and perform activities. As conceptual skills grow in children, they learn to categorize activities based on a complex interplay of spatial relations among objects . Similarly, we are eventually able to quickly decompose and distinguish activities by examining the objects in the scene . So within the human visual system, static objects serve as building blocks towards learning to recognize dynamic actions.
Given the role of objects for actions in the human visual system and empowered by leaps in progress in object detection (e.g. ), we feel the time is ripe to investigate to what extent object knowledge enables automatic action localization.
Scoring and linking boxes without training examples
In our zero-shot formulation, we are given a set of test videos V and a set of action class names Z. We aim to classify each video to its correct class and to discover the spatio-temporal tubes encapsulating each action in all videos. To that end, we propose a spatial-aware embedding; scored action tubes from interactions between actors and local objects. We present our embeddings in three steps: (i) gathering prior knowledge on actions, actors, objects, and their interactions, (ii) computing spatial-aware embedding scores for bounding boxes, and (iii) linking boxes into action tubes.
Prior knowledge. We aim to transfer knowledge about objects to actions. Therefore, we first need detectors to detect objects and actors locally in videos. Here, we use Faster R-CNN , pre-trained on MS-COCO. This network provides 80 detectors: 79 object detectors and an actor detector. Second, we need to match object names with action names. For this, we employ word2vec embeddings . This results in a 500-dimensional representation per word, where we use the cosine similarity between the average word vectors for objects and actions as the similarity measure.
Third, we aim to incorporate the notion of spatial-awareness. Intuitively, this notion states that objects and actors have a preferred spatial relation when performing an action. E.g. a ball is near the feet of a person when kicking a ball and an umbrella is above a person walking in the rain. Here, we divide the spatial relation between object and actor into a 9-dimensional grid, representing the preposition in front of and the eight basic prepositions around the actor, i.e. left of, right of, above, below, and the four corners (e.g. above left). We gather these prior statistics from the same source as used to train the detectors themselves, namely MS-COCO. We gather and relate the box annotations in the training images for each object and average the relations. Spatial-aware examples for skateboard, bicycle, and traffic light are shown below.
Scoring action boxes. We exploit our sources of prior knowledge to compute a score for the detected bounding boxes in all frames of each test video V. Given a bounding box b in frame F of video V, we define a score function that incorporates the presence of (i) actors, (ii) relevant local objects, and (iii) the preferred spatial relation between actors and objects. Below we show how all elements are combined into a single score.
The first part of the score function states that the score is proportional to the actor likelihood in the box. Naturally, an action requires an actor. The second part constitutes a sum over all objects. The score for an object is firstly proportional to the semantic similarity to action Z. Second, the score is proportional to its presence, which is the maximum score over all neighbouring boxes Fn, with that score proportional to both the object likelihood and spatial-awareness match. The spatial-awareness match is computed using the Jensen-Shannon Divergence between the object-actor relations in the frame and the prior computed spatial relations between object and actor.
Linking action boxes. Given a spatial-aware embedding score for each bounding box in each frame of a video, we apply the score function to the boxes of all actor detections in each frame. We form tubes from the individual box scores by linking them over time. We link those boxes over time that by themselves have a high score from our spatial-aware embedding and have a high overlap amongst each other. This maximization problem is solved using dynamic programming with the Viterbi algorithm. Once we have a tube from the optimization, we remove all boxes from that tube and compute the next tube from the remaining boxes. We use the average box score as the final score for each tube.
The result of the box scoring and linking is a set of tubes for an action in a set of test videos. This set can be ranked using the tube scores, which provides the final zero-shot action localization.
On global versus local objects
The zero-shot action tube algorithm outlined is based on the notion of localized objects that are relevant to the action. There is however a class of objects that are not relevant for action localization, but are relevant for action discrimination. To better understand this, let's examine a video frame for the action Kicking.
For the location of Kicking within the video, the ball object is intuitively critical. There are however many more objects in the scene, e.g. a goal, commercial sign, and corner post. While these objects do not directly pinpoint which actor is Kicking, they do provide contextual information. Contextual information from global objects do not help to localize actions within videos, but do help to discriminate action tubes from different videos. We therefore seek to also incorporate global object information in our approach. In short, we incorporate this information by computing object scores over whole videos using a network pre-trained on 13k ImageNet objects . These objects are matches with actions again using word2vec. For an action, this results in a global score per video, which we add to the score of the computed tubes within the video.
Main experimental results
The influence of local objects and spatial awareness
To evaluate our approach, we first perform several ablation studies on the core components: (i) the influence of local objects, (ii) the influence of spatial awareness, and (iii) how many objects are required. Here we show the results on the UCF Sports dataset for both zero-shot action localization and classification. An overview of the results are shown below.
The results above show that when only actor information is employed, random scores are achieved for both localization and classification. This is as expected, since all actions are actor-centric in current action localization datasets. When local object information is added, scores are improved, while the scores are further improved when incorporating spatial awareness. This shows that both elements positively impact the zero-shot action localization performance. Interestingly, we find that one object is preferred for localization, while five objects are preferred for classification. We found that when more objectd are used, the discrimination between actions increases slightly, at the cost of the precise action location. Since a loss in location accuracy does not affect the classification loss, more objects are preferred for classification.
The results above provide an indication that spatial-aware object embeddings allow for zero-shot action localization, but when does the localization go right and when does it go wrong? Below we show four localization results for different actions in UCF Sports.
For the two leftmost examples, we see that the correct action localization is provided for Skateboarding and Riding a horse, due to both the detection of relevant objects and expected spatial awareness. For the two rightmost examples, the localization fails. For the action Kicking, the top object is tie according to the word2vec embeddings, while we expected the object ball to be most relevant. This is due to semantic ambiguity, since the word tie represents the visual object bowtie among our detectors, also can also mean a draw in a football match. As a result, the referee is employed as the localization for that video over the person kicking the ball. We found that this was because the actor probability was higher for the referee, which triggers the first part of our box scoring. For the action Swinging on a bar, we observe similar problems. The person swinging is missed here too because the person detection can not fully deal with the large variety of poses displayed by the person swinging in circles. We conclude that for better zero-shot action localization within our approach, we more desirable semantic mapping can boost the results further.
Here, we also show some results of our approach compared to the state-of-the-art in action localization. We show again the results on UCF Sports, where we include global information as described earlier. The localization results for various overlap thresholds are shown below.
Only one paper previously attempted a zero-shot action localization experiment, namely in Jain et al. . Their notion was to embed spatio-temporal proposals with object scores and map these scores to action names. Implicitly, their assumption is that objects should be on the actor, while we provide a generalized view through spatial-awareness on or around the actor. The results are also considerably in our favor, with an improvement of a factor of three to four at high overlap thresholds.
Compared to supervised approaches, we observe that the best approach performs better. We are however competitive to several supervised approaches, especially at high overlap thresholds. This result indicates the effectiveness of our zero-shot approach.
For all other experimental results, including the influence of global objects and a comparitive evaluation on UCF-101, J-HMDB, and Hollywood2Tubes, we refer to our paper .
Zero-shot action localization retrieval
Lastly, we want to show that our approach opens the door to new retrieval tasks. Since we only require the name of the action of interest. It becomes possible to perform on-the-fly localized action retrieval. Below, we show the results of several queries on J-HMDB.
The examples above show that when directly specifying the object and spatial-awareness, we can yield relevant localized results. What is more, we can additionally specify the expected size of the object compared to the actor in the query, to further specify the desired result. The middle and right examples show that a difference in size for a sports ball results in localization results with different ball sizes, respectively a baseball and basket ball. We hope that this door leads to more research into retrieval beyond retrieving whole videos from a collection.
Objects provide an clear cue about which actions are occurring when and where in a video. This premise is intuitive, but was not yet investigated thoroughly. In this work, we provide an approach to localize, classify, and retrieve actions purely based on knowledge about objects, actors, and their spatial relations. This resulted in state-of-the-art zero-shot action localization and classification results, beating other approaches.
While the results are encouraging, they are far from practical and major improvements are required. We have already outlined limitations with respect to the semantic matches between objects and actions, but a more honest answer is that the approach of this work is limited and but a first step in zero-shot action localization.
Actions much more than objects and their spatial relation to objects. To take the next step in zero-shot action localization requires incorporating temporal knowledge, most notably a spatio-temporal awareness between objects and actors. Think about the object basketball for example. An actor dribbling and holding a ball are different actions, but employ the same objects. What differentiates them is their temporal footprint. Beyond spatio-temporal object-actor relations, this line of work can inherently not deal with actions that do not require objects, such as Running. Zero-shot localization of actions without objects requires spatio-temporal knowledge of the actors themselves.
 Mettes, Pascal and Cees G M Snoek. "Spatial-aware object embeddings for zero-shot localization and classification of actions." ICCV (2017).
 Baillargeon, Renée, and Su-hua Wang. "Event categorization in infancy." Trends in cognitive sciences 6.2 (2002): 85-93.
 Shipley, Thomas F., and Jeffrey M. Zacks. Understanding events: From perception to action. Oxford University Press (2008).
 Ren, Shaoqing, et al. "Faster r-cnn: Towards real-time object detection with region proposal networks." NIPS (2015).
 Mikolov, Tomas, et al. "Distributed representations of words and phrases and their compositionality." NIPS (2013).
 Mettes, Pascal, et al. "The imagenet shuffle: Reorganized pre-training for video event detection." ICMR (2016).