Siamese Instance Search for Tracking (SINT)

Ran Tao, Efstratios Gavves, Arnold Smeulders
QUVA Lab, University of Amsterdam


In this paper we present a tracker, which is radically different from state-of-the-art trackers: we apply no model updating, no occlusion detection, no combination of trackers, no geometric matching, and still deliver state-of-the-art tracking performance, as demonstrated on the popular online tracking benchmark (OTB) and six very challenging YouTube videos. The presented tracker simply matches the initial patch of the target in the first frame with candidates in a new frame and returns the most similar patch by a learned matching function. The strength of the matching function comes from being extensively trained generically, i.e., without any data of the target, using a Siamese deep neural network, which we design for tracking. Once learned, the matching function is used as is, without any adapting, to track previously unseen targets. It turns out that the learned matching function is so powerful that a simple tracker built upon it, coined Siamese INstance search Tracker, SINT, which only uses the original observation of the target from the first frame, suffices to reach state-of-the-art performance. Further, we show the proposed tracker even allows for target re-identification after the target was absent for a complete video shot.



CVPR 2016 Paper

        title={Siamese Instance Search for Tracking},
        author={Tao, Ran and Gavves, Efstratios and Smeulders, Arnold W M},
        booktitle={Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition},

Performance on OTB (51 sequences)

In spite of the fact that the online part of the proposed SINT is just selecting the patch that matches best to the target in the first frame, SINT is on par with state-of-the-art trackers. SINT+, using a better candidate sampling than SINT and optical flow as an additional component, achieves the best performance.


Performance on six additional challenging sequences

Comparison with MEEM and MUSTer in AUC score

MEEM (Zhang et al, ECCV14)MUSTer (Hong et al, CVPR15) SINT (ours)
Fishing 4.3% 11.2% 53.7%
Rally 20.4% 27.5% 53.4%
BirdAttack 40.7% 50.2% 66.7%
Soccer 36.9% 48.0% 72.5%
GD 13.8% 34.9% 35.8%
Dancing 60.3% 54.7% 66.8%
mean 29.4% 37.8% 58.1%

Target Re-identification

In the absent of any drifting, SINT allows for target re-identification after the target was absent for a long period of time, provided with a sampling over the whole image.


Code, model and result files



r.tao at


Copyright Notice

The documents contained in these directories are included by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright.