Abstract:
Traditional visual tracking methods based on the Siamese network extract pairs of frames from a large number of videos and train them on the offline independently at the stagey of training. They lack the update of the model features and neglect the background information, so the tracking accuracy is a little bit low in the complex environments such as background clutter. In response to the above problems, this paper proposes a dual-path Siamese network visual tracking method with the attention mechanism. The method mainly includes the feature extractor part and the feature fusion part. In the feature extractor part, the residual network is improved and a dual-path network model is designed. By combining the reusability of the residual networks to features of the former layer and the extraction of new features from the dense networks, these two networks are spliced for the feature extraction. At the same time, this paper uses the dilated convolution to replace the traditional convolution, which improves the resolution on the condition of maintaining a certain receptive field. This dual-path feature extraction method can implicitly update the model features, so that obtain the more accurate image feature information. Moreover, the attention mechanism is introduced to the feature fusion part, which can distribute the different weights to the different parts of the feature maps. In the channel domain, the method screens the valuable target image information and enhances the interdependence between the channels. In the spatial domain, it also pays more attention to the local important information and learns more rich contextual connections, which effectively improves the accuracy of object tracking. To confirm the effectiveness of the method, some experiments are conducted on the OTB100 and VOT2016 datasets. We use precision, success rate and expect average overlap-rate as the evaluation criterion, and their values are 0.868, 0.641 and 0.350 respectively on the two datasets, which increase by 5.1%, 2.0% and 0.9% compared with those of the benchmark model. Experimental results show that the proposed method makes full use of the advantages of different networks, and while ensuring the accuracy of the model, it can adapt to the deformation of the target well, reduce the interference between the similar objects, and achieve more stable tracking effect.