This project focuses on the temporal aspect for recognizing human activities in videos; an important visual cue that has long been either disregarded or ill-used. We revisit the conventional definition of an activity and restrict it to Complex Action: a set of one-actions with a weak temporal pattern that serves a specific purpose.


Related works use spatiotemporal 3D convolutions with fixed kernel size, too rigid to capture the varieties in temporal extents of complex actions, and too short for long-range temporal modeling. In contrast, we use multi-scale temporal convolutions, and we reduce the complexity of 3D convolutions. The outcome is Timeception convolution layers, which reasons about minute-long temporal patterns, a factor of 8 longer than best related works.


Timeception achieves impressive accuracy in recognizing human activities of Charades. Further, we conduct analysis to demonstrate that Timeception learns long-range temporal dependencies and tolerate temporal extents of complex actions.


A complex action, e.g. Cooking a Meal enjoys few properties:
- composition: consists of several one-actions (Cook, ...)
- order: weak temporal order of one-actions (Get Î Wash)
- extent: one-actions vary in their temporal extents
- holistic: it emerges throughout the entire video

The core component of our method is Timeception layer (left). Simply, it takes as an input the features X; corresponding to T timesteps from the previous layer in the network. Then, it splits them into N groups, and temporally convolves each group using temporal convolution module (right). It is novel building block comprising multi-scale temporal-only convolutions to tolerate a variety of temporal extents in a complex action. Timeception makes use of grouped convolutions and channel shuffling, to learn cross-channel correlations, efficiently than 1×1 spatial convolutions.

In order to tolerate temporal extents, Timeception uses multi-scale temporal kernels, with two options:
i. different kernel sizes and fixed dilation rate
ii. different dilation rates and fixed kernel size


The learned weights by temporal convolutions of three Timeception layers. Each uses multi-scale convolutions with varying kernel sizes k ∈ {3, 5, 7}. In bottom layer (1), we notice that long kernels (k = 7) captures fine-grained temporal dependencies. But at the top layer (3), the long kernels tend to focus on coarse-grained temporal correlation. The same behavior prevails for the shot (k = 3) and medium (k = 5) kernels.

Ours Method Modality mAP (%)
ResNet-152 RGB 22.8
ResNet-152 + Timeception RGB 31.6
I3D RGB 32.9
I3D + Timeception RGB 37.2
3D ResNet-101 RGB 35.5
3D ResNet-101 + NL RGB 37.5
3D ResNet-50 + GCN RGB 37.5
3D ResNet-101 + GCN RGB 39.7
3D ResNet-101 + Timeception RGB 41.1

Timeception outperforms related works using the same backbone CNN. It achieves the absolute gain of 8.8% and 4.3% over ResNet and I3D, respectively. More over, using the full capacity of Timeception improves 1.4% over best related work.



Timeception for Complex Action Recognition
 title={Timeception for Complex Action Recognition},
 author={Hussein, Noureldien and Gavves, Efstratios and Smeulders, Arnold WM},
 journal={arXiv preprint arXiv:1812.01289},