Image Captioning Method for Fusing Multi-temporal Dimensional Visual and Semantic Information
CSTR:
Author:
Affiliation:

School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China

Clc Number:

TP391

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Traditional image captioning methods use only the visual and semantic information of the current moment to generate prediction words without considering the visual and semantic information of the past moments, which leads to the output of the model to be relatively homogeneous in terms of temporal dimension. As a result, the generated captioning is lacking in terms of accuracy. To address this problem, an image captioning method that fuses multi-temporal dimensional visual and semantic information is proposed, which effectively fuses visual and semantic information of past moments and designs a gating mechanism to dynamically select both kinds of information. Experimental validation on the MSCOCO dataset shows that the method is able to generate captioning more accurately, and the performance is considerably improved in all evaluation metrics when compared with the most current state-of-the-art image captioning methods.

    Reference
    Related
    Cited by
Get Citation

CHEN Shanxue, WANG Cheng. Image Captioning Method for Fusing Multi-temporal Dimensional Visual and Semantic Information[J].,2024,39(4):922-932.

Copy
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:May 24,2023
  • Revised:June 26,2023
  • Adopted:
  • Online: July 25,2024
  • Published:
Article QR Code