基于图神经网络和引导向量的图像字幕生成模型
作者:
作者单位:

上海理工大学光电信息与计算机工程学院,上海 200093

作者简介:

通讯作者:

基金项目:

国家重点研发计划项目(2018YFB1700902)。


Image Caption Generation Model Based on Graph Neural Network and Guidance Vector
Author:
Affiliation:

College of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    近年来,深度学习已在图像字幕技术研究中展现其优势。在深度学习模型中,图像中对象之间的关系在图像表示中起着重要作用。为了更好地检测图像中的视觉关系,本文基于图神经网络和引导向量构建了图像字幕生成模型(YOLOv4-GCN-GRU, YGG)。该模型利用图像中被检测到的对象的空间和语义信息建立成图,利用图卷积神经网络(Graph convolutional network, GCN)作为编码器对图的每个区域进行表示。在字幕生成阶段,额外训练一个引导神经网络来产生引导向量,从而辅助生成模型自动生成语句。基于MSCOCO图像数据集的对比实验表明,YGG模型具有更好的性能,将CIDEr-D的性能从138.9%提高到了142.1%。

    Abstract:

    In recent years, deep learning has shown its advantages in the research of image caption technology. In deep learning model, the relationship between objects in image plays an important role in image representation. In order to better detect the visual relationship in the image, an image caption generation model (YOLOv4-GCN-GRU, YGG) is constructed based on graph neural network and guidance vector. The model uses the spatial and semantic information of the detected objects in the image to build a graph, and uses graph convolutional network (GCN) as an encoder to represent each region of the graph. In the process of decoding, an additional guidance neural network is trained to generate guidance vector, so as to assist the decoder to automatically generate sentences. Comparative experiments based on MSCOCO image dataset show that YGG model has better performance, and the performance of CIDEr-D is improved from 138.9% to 142.1%.

    参考文献
    相似文献
    引证文献
引用本文

佟国香,李乐阳.基于图神经网络和引导向量的图像字幕生成模型[J].数据采集与处理,2023,38(1):209-219

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2022-01-03
  • 最后修改日期:2022-04-18
  • 录用日期:
  • 在线发布日期: 2023-01-25