Blog post - June 15, 2018

Pointly-supervised Action Localization


Action localization in videos entails detecting the spatio-temporal locations of action instances in a set of videos. With a spatio-temporal location we mean a set of consecutive bounding boxes in video frames that form a tube. This task is intuitively quite difficult. The search space of all possible tubes in a single video is for instance already intractable for enumeration. Therefore, works in action localization typically propose various ways to reduce the search space of action in videos, e.g. by first search in space and then in time or by generating a limited set of spatio-temporal proposals, with the hope that at there is at least one matching proposal for each action in each video.

As is common in supervised learning in general, most works in action localization assume that the annotations during training mimick the output required during testing. In this context, it means that for each action instance, a bounding box needs to be annotated in each video frame to from ground truth tubes. It goes without saying this level of annotation creates a heavy burden for annotators. Annotating bounding boxes is slow and cumbersome. As such, current datasets in action localization are typically small and short, which limits the extend to which we can research action localization.

To overcome the current annotation burden, we have investigated a new annotation paradigm for action localization, namely point supervision. The premise is easy; replace each box annotation with a point on the center of the action. With points we make one major concession: we discard the spatial extent of actions and only retain the spatial location. The temporal extent of actions remains preserved with point annotations. The primary advantage of points is naturally that the annotation burden is vastly reduced. Points can be easily obtained by clicking on the action and tracing this click through the video.

This blog post outlines how we have tackled the main challenge of point supervision, which is the lack of knowledge about the spatial extent of actions in videos. The blog post first discusses spatio-temporal proposals, a basis components of our approach. Then, we outline how to train action localization models with points and proposals. Using these models, we ask the question: can points replace boxes for action localization with spatio-temporal proposals? Based on the results, we outline what are the open research items for future work. A more formal discussion on pointly-supervised action localization can be found in [1,2]

A short introduction of spatio-temporal Action Proposals

Our approach employs unsupervised spatio-temporal to provide a set of candidate action locations in videos. But what are spatio-temporal proposals actually and how are they generated? Spatio-temporal proposals are an extension of object proposals to the video domain. Object proposal methods aim to reduce the set of possible bounding boxes to evaluate in an image. Well known examples unclude Selective Search [3], EdgeBoxes [4], and BING [5].

In our work, we rely on the spatio-temporal action proposals from APT [6]. The approach roughly works as follows: (1) Compute dense trajectories, which are local spatio-temporal tracks with corresponding spatial and temporal features. (2) Perform hierarchical clustering on the trajectories for each feature and each combination of features. (3) Retain only clusters that merge non-trivial prior clusters. The retained clusters form the set of spatio-temporal proposals of a video. APT typically outputs in the other of 1e3 or 1e4 proposals, with the hope that at least one proposal matches well with the action. The code is available here. Below, there is an example of a spatio-temporal proposal (blue) that has a high overlap with the ground truth location (red).

Figure 1. Visual example of a spatio-temporal proposal (blue) generated with APT. Figure courtesy of [6].

Action localization with point supervision

Equipped with point supervision and spatio-temporal proposals, the main question becomes how to effectively perform action localization. The traditional approach with spatio-temporal proposalss is to first train a model using features from the ground truth box tubes. This model is then used to score and select the best spatio-temporal proposal(s) in test videos. In our case however, we no longer have ground truth box tubes.

Our solution starts with an observation: spatio-temporal proposals do not need to be restricted to testing. Since the proposals are typically unsupervised, nothing is stopping us from using them during training. The proposals are generated with the goal that at least one proposal in a video has a high match with the action. As such, our goal is find the top proposals in training videos, where points eventually will be our guide.

Multiple Instance Learning is not enough. With spatio-temporal proposals computed for training videos, the action localization problem can be cast as another well-known problem: Multiple Instance Learning. Multiple Instance Learning is a weakly-supervised learning setting, where labels are only provided for collections of examples (bags), rather than individual examples (instances). In the context of action localization, the bags are the videos with corresponding action labels, while the spatio-temporal proposals are the instances. As a result, we can apply existing Multiple Instance Learning algorithms to our problem to yield our models, akin to the work of Cinbis et al. [7], who performed a similar analysis in the context of object detection in images with object proposals. When we performed a similar analysis, we found that standard Multiple Instance Learning for action localization yields models that can discriminate proposals from different actions well, but localize the actions poorly within the videos. This is not a big surprise, as Multiple Instance Learning is designed to differentiate bags (i.e. videos) only. In action localization, we additionally care about the precise spatio-temporal extent of the actions at the instance level. As such, we have investigated how to incorporate minimal additional supervision in the form of point supervision to extend the Multiple Instance Learning framework.

An overlap measure between points and proposals. With points as additional supervision, we propose to use use them as a prior on the spatio-temporal proposals within the Multiple Instance Learning framework. We therefore need to have a rough quality measure of spatio-temporal proposals given a set of point annotations. Given that points are annotated on the center of the action, such a measure should promote proposals whose center are closely located to point annotations. Here, we defined our overlap function as the average match between the point annotations and a spatio-temporal proposal. The match is defined as one minus the ratio between the distance of the point to the proposal center and the distance between the center and nearest edge of the proposal. When a point and center are on the same location, the match is one. This match decreases linearly as the distance to the center increases and becomes zero once points are outside proposals.

The overlap measure requires one additional regularization component. Since actions are typically near the center of the video and since the center of large proposals is also near the center of the video, the measure has a bias towards selecting large proposals, leading to poor localization. As such, we found that we have to subtract the squared relative size of the proposal to the video from the match to yield good priors. More formal definitions and equation of the overlap measure can be found in [1,2].

With prior scores for spatio-temporal proposals given points, we extend the max-margin Multiple Instance Learning objective of Andrews et al. [8]. This objective is optimized in an EM-style approach, where first instances are fixed while the model parameters are updated, followed by a step where the model parameters are fixed and used to select the new best instances. In our case, the priors are used to replace both the initialization and the Maximum Likelihood steps with the overlap priors. This results in an optimization that jointly aims to discriminate proposals from different actions and follow the spatio-temporal extent of the actions within the videos. Again, more technical details can be found in [1,2].

Can points replace boxes?

For our pointly-supervised action localization, we have performed various experiments and analyses. In this blog, we focus on our main experiment: How effective is pointly-supervised action localization with spatio-temporal proposals? Implementation details, proposal representations, and parameter settings are skipped in the blog, but are provided in [1,2]. Without much further ado, the figure below provides the results for our most prominent on two common action localization datasets: UCF Sports and UCF-101.

Figure 2. Localization results on UCF Sports and UCF-101 for the mAP and AUC measures, using full box supervision, point supervision, and using video-label supervision. Across both datasets and all overlap thresholds, point-supervision is as effective as box-supervision, while outperforming video-label supervision. We conclude that spatial annotations are vital and that points provide sufficient support for effective localization.

In Figure 2, our approach is denoted with the dark orange bars. We compare our approach to three baselines with identical settings and optimization. They only differ in level of supervision. Our main findings are as follows: (1) Our pointly-supervised approach yields similar results to full box supervision, confirming our main hypothesis. (2) Our pointly-supervised approach yields similar results to training on the best possible proposal, highlighting the effectiveness of points as priors in our extended Multiple Instance Learning framework. (3) Using standard Multiple Instance Learning (i.e. video labels only) is not nearly as effective, reiterating the importance of including point supervision for action localization.

Below, we highlight a few qualitative examples of discovered actions in UCF Sports videos using points (red) compared to the ground truth (blue). These results show that the fit is high for simple actions such as walking. For actions such as lifting and horse riding, we find that our pointly-supervised approach learns the spatial extent of actions discriminatively and therefore focuses on the most invariant spatial component of the action, such as the barbell and the horse.
Figure 3. Qualitative examples on UCF Sports for our approach (red) compared to the ground truth (blue). For non-trivial actions, our pointly-supervised approach learns the spatial extent of actions discriminatively, focusing e.g. on the barbell (lifting) and the horse (horse riding), while the ground truth annotations focus on the person.

For more analyses, error diagnoses, upper bounds on the speed-up for point supervision, robustness to noisy and sparse annotations, and a comparison to the state-of-the-art, we refer the reader to [1].

What is holding us back?

While the results above are encouraging, there is a caveat. If we compare the achieved results to the current state-of-the-art, we come up short. A primary difference is that the state-of-the-art has recently deviated from the norm of spatio-temporal proposals for action localization. Here, we evaluate the influence of the spatio-temporal proposals upon which our approach is built. Spatio-temporal proposals optimize recall, i.e., for a video, at least one proposal should have a high overlap to the ground truth action localization. An inconvenient side-effect from this requirement is that each video outputs many proposals that have a low overlap, making the selection of the best proposal during testing a needle in the haystack problem. We quantify the influence of this side-effect by investigating the influence of the high ratio of proposals with a low overlap during training and testing. During both training and testing, we add the oracle ground truth tube to the proposals. We furthermore add a parameter ε, which controls the fraction of proposals with an overlap below 0.5. We train models with varying values for ε. We evaluate this oracle experiment on UCF Sports.

Figure 4. Influence of spatio-temporal proposal quality on UCF-Sports. The baseline corresponds to the result from the first experiment. For the others, the ground truth location is added as one of the proposals dur- ing testing. Where, ε states the fraction of low quality (overlap ≤ 0.5) proposals that are removed. Action localization performance increases when large amounts of low quality proposals are removed. We conclude that better quality action proposals will have a positive impact on pointly-supervised action localization.

The Figure above paints a clear picture: the big mountain of low quality spatio-temporal proposals hampers the performance greatly. For example, on UCF Sports, only 7% of the proposals have an overlap of at least 0.5 (!). Only when removing most of the low quality we can achieve competitive results.

The results above show why spatio-temporal proposals are getting out of fashion. Why search for a needle in the haystack? Spatio-temporal proposals do maintain several benefits. Most notably, spatio-temporal proposals are unsupervised, which means that the intractable search space of actions can be reduced to a few hundred to a few thousand tubes without the need for manual annotations. But to make them viable in current literature again, care needs to be taken to (i) generate better fitting proposals, (ii) filter low quality proposals, and (iii) guide the selection towards high-quality proposals.

Conclusions and open research items

This blog has detailed our investigation into action localization with point supervision. We start from the observation that spatio-temporal proposals are not restricted to testing and can be used during training. As a result, action localization can be cast as Multiple Instance Learning. Since standard Multiple Instance Learning is not sufficient, we propose point supervision as additional information. We find that points are as effective as boxes for action localization. Moreover, points are 10 to 100 times faster to annotate than boxes [1]. We therefore recommend point supervision for action localization with spatio-temporal proposals.

Concurrent with spatio-temporal proposals, another recent line of work focuses on action localization by separating the spatial and temporal localization, see e.g. [9,10]. A model is first learned at the box level, after which boxes are linked over time. While effective, these methods are more reliant on expensive box supervision. Investigating how point supervision can replace boxes in this framework constitutes viable future work for action localization.


[1] Pascal Mettes, and Cees G. M. Snoek. "Pointly-Supervised Action Localization." arXiv:1805.11333 (2018).
[2] Pascal Mettes, Jan C. van Gemert, and Cees G. M. Snoek. "Spot on: Action localization from pointly-supervised proposals." ECCV (2016).
[3] Jasper R. R. Uijlings, Koen E. A. Van De Sande, Theo Gevers, and Arnold W. M. Smeulders. "Selective search for object recognition." IJCV (2013).
[4] Zitnick, C. Lawrence, and Piotr Dollár. "Edge boxes: Locating object proposals from edges." ECCV (2014).
[5] M. M. Cheng, Z. Zhang, W. Y. Lin, and P. Torr. "BING: Binarized normed gradients for objectness estimation at 300fps." CVPR (2014).
[6] Jan C. van Gemert, Mihir Jain, Ella Gati, and Cees G. M. Snoek. "APT: Action localization proposals from dense trajectories." BMVC (2015).
[7] Ramazan G. Cinbis, Jakob Verbeek, and Cordelia Schmid. "Weakly supervised object localization with multi-fold multiple instance learning." TPAMI (2017).
[8] Stuart Andrews, Ioannis Tsochantaridis, and Thomas Hofmann. "Support vector machines for multiple-instance learning." NIPS (2003).
[9] Georgia Gkioxari, and Jitendra Malik. "Finding action tubes." CVPR (2015).
[10] G. Singh, S. Saha, M. Sapienza, P. Torr, and F. Cuzzolin. "Online real-time multiple spatiotemporal action localisation and prediction." ICCV (2017).