融合注意力机制的双路径孪生视觉跟踪方法
作者:
作者单位:

1.昆明理工大学信息工程与自动化学院,昆明 650500;2.昆明理工大学云南省计算机技术应用重点实验室,昆明 650500

作者简介:

通讯作者:

基金项目:

国家自然科学基金(61971208, 61671225, 52061020, 61702128);云南省应用基础研究计划重点项目(2018FA034);云南省中青年学术技术带头人后备人才计划(Shen Tao, 2018);云南省万人计划青年拔尖人才计划(沈韬,朱艳,云南省人社厅No.2018 73);昆明理工大学人才培养计划(KKSY201703016)。


Dual-Path Siamese Network Visual Tracking Method with Attention Mechanism
Author:
Affiliation:

1.Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China;2.Yunnan Key Laboratory of Computer Technologies Application, Kunming University of Science and Technology, Kunming 650500, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    传统基于孪生网络的视觉跟踪方法在训练时是通过从大量视频中提取成对帧并且在线下独立进行训练而成,缺乏对模型特征的更新,并且会忽略背景信息,在背景驳杂等复杂环境下跟踪精度较低。针对上述问题,提出了一种融合注意力机制的双路径孪生网络视觉跟踪算法。该算法主要包括特征提取器部分和特征融合部分。特征提取器部分对残差网络进行改进,设计了一种双路径网络模型;通过结合残差网络对前层特征的复用性和密集连接网络对新特征的提取,将2种网络拼接后用于特征提取;同时采用膨胀卷积代替传统卷积方式,在保持一定感受视野的情况下提高了分辨率。这种双路径特征提取方式可以隐式地更新模型特征,获得更准确的图像特征信息。特征融合部分引入注意力机制,对特征图不同部分分配权重。通道域上筛选出有价值的目标图像信息,增强通道间的相互依赖;空间域上则更加关注局部重要信息,学习更丰富的上下文联系,有效地提高了目标跟踪的精度。为证明该方法的有效性,在OTB100和VOT2016数据集上进行验证,分别使用精确率(Precision)、成功率(Success rate)和平均重叠期望(Expect average overlaprate,EAO)作为评价标准。结果显示,本文算法的精确率、成功率和平均重叠期望分别为0.868、0.641和0.350;相比基准模型分别提高了5.1%、2.0%和0.9%。结果证明本文算法充分利用了不同网络的优点,在保证模型精度的同时,能够较好地适应目标外观的变化,降低相似物的干扰,取得更稳定的跟踪效果。

    Abstract:

    Traditional visual tracking methods based on the Siamese network extract pairs of frames from a large number of videos and train them on the offline independently at the stagey of training. They lack the update of the model features and neglect the background information, so the tracking accuracy is a little bit low in the complex environments such as background clutter. In response to the above problems, this paper proposes a dual-path Siamese network visual tracking method with the attention mechanism. The method mainly includes the feature extractor part and the feature fusion part. In the feature extractor part, the residual network is improved and a dual-path network model is designed. By combining the reusability of the residual networks to features of the former layer and the extraction of new features from the dense networks, these two networks are spliced for the feature extraction. At the same time, this paper uses the dilated convolution to replace the traditional convolution, which improves the resolution on the condition of maintaining a certain receptive field. This dual-path feature extraction method can implicitly update the model features, so that obtain the more accurate image feature information. Moreover, the attention mechanism is introduced to the feature fusion part, which can distribute the different weights to the different parts of the feature maps. In the channel domain, the method screens the valuable target image information and enhances the interdependence between the channels. In the spatial domain, it also pays more attention to the local important information and learns more rich contextual connections, which effectively improves the accuracy of object tracking. To confirm the effectiveness of the method, some experiments are conducted on the OTB100 and VOT2016 datasets. We use precision, success rate and expect average overlap-rate as the evaluation criterion, and their values are 0.868, 0.641 and 0.350 respectively on the two datasets, which increase by 5.1%, 2.0% and 0.9% compared with those of the benchmark model. Experimental results show that the proposed method makes full use of the advantages of different networks, and while ensuring the accuracy of the model, it can adapt to the deformation of the target well, reduce the interference between the similar objects, and achieve more stable tracking effect.

    表 3 OTB100上完整消融实验结果Table 3 Diagram of complete ablation experiment result on OTB100 Dataset
    表 2 OTB100上未添加和添加注意力结果对比Table 2 Experimental results with and without adding attention on OTB100 dataset
    图1 AlexNet网络结构图Fig.1 Structure diagram of AlexNet
    图2 跳层连接示意图Fig.2 Shortcut connection schematic
    图3 DenseNet网络结构图Fig.3 Structure diagram of DenseNet
    图4 空间注意力结构图Fig.4 Structure diagram of spatial attention
    图5 整体网络框架图Fig.5 Frame diagram of whole network
    图6 双路径网络结构图Fig.6 Structure diagram of DualpathNet
    图7 膨胀卷积结构图Fig.7 Structure diagram of dilated convolution
    图8 SE模块结构图Fig.8 Structure diagram of SE block
    图9 可视化CAM热力图Fig.9 Visualized CAM heat map
    图10 特征融合模块Fig.10 Feature fusion block
    图11 目标分类和回归模块Fig.11 Target classification and regression Block
    图12 样例图Fig.12 Sample images
    图13 低分辨率和超出视野情况下精确度结果图Fig.13 Results of precision on low resolution and out of view
    图14 超出视野和背景驳杂情况下成功率结果图Fig.14 Results of success rate on out of view and background clutters
    图15 背景驳杂和形变情况下精确度结果图Fig.15 Results of precision on backround clutters and deformation
    图16 低分辨率和形变情况下成功率结果图Fig.16 Results of success rate on low resolution and deformation
    图17 OTB100数据集综合性能指标Fig.17 Integrated performance index on OTB100 dataset
    表 4 不同算法在OTB100和VOT2016上的结果对比Table 4 Comparison of results of different algorithms on OTB100 and VOT2016 datasets
    表 1 不同backbone在OTB100数据集上实验结果对比Table 1 Experimental results of different backbones on OTB100 dataset
    参考文献
    相似文献
    引证文献
引用本文

谢江,朱艳,沈韬,曾凯,刘英莉.融合注意力机制的双路径孪生视觉跟踪方法[J].数据采集与处理,2022,37(1):94-107

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2021-03-26
  • 最后修改日期:2021-06-16
  • 录用日期:
  • 在线发布日期: 2022-01-25