Objects2action: Classifying and localizing actions without any video example

ICCV 2015
Mihir Jain, Jan van Gemert, Thomas Mensink and Cees Snoek
University of Amsterdam, Qualcomm Research Netherlands


In our paper, we present objects2action, a semantic embedding to classify actions in videos without using any video data or action annotations as prior knowledge. Instead it relies on commonly available object annotations, images and textual descriptions. Our semantic embedding has three main characteristics to accommodate for the specifics of actions:
  • A mechanism to exploit multiple-word descriptions of actions and ImageNet objects.

  • It incorporates the automated selection of the most responsive objects per action.

  • Readily adaptable for zero-shot approach to action classification and spatiotemporal localization of actions.


We make our word2vec embeddings, proposed embedding of object and action class labels and object representation of videos available for download:

If you have any questions, please send me an e-mail at mihir.jain18@gmail.com