跨模态视觉问答与推理研究进展

doi:10.16337/j.1004-9037.2023.01.001

首页 > 按月查看>2023年第1月 >1-20. DOI:10.16337/j.1004-9037.2023.01.001

跨模态视觉问答与推理研究进展
DOI:
                        10.16337/j.1004-9037.2023.01.001
                    
作者:
                        
                        
                    
作者单位:1.天津理工大学计算机科学与工程学院， 天津 300384;2.北京邮电大学人工智能学院， 北京 100876
作者简介:
通讯作者:
基金项目:国家重点研发计划(2018AAA0102200); 国家自然科学基金(62036012, 62002355, 61832002, 62072455, 62102415, 62106262, 62006227); 北京自然科学基金 (L201001)。

Recent Advances in Visual Question Answering and Reasoning

Author:

Affiliation:

1.School of Computer Science and Engineering, Tianjin University of Technology, Tianjin 300384, China;2.School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

随着社交媒体和人机交互技术的快速发展，视频、图像以及文本等多模态数据在互联网中呈爆炸式增长，因此多模态智能研究受到关注。其中，视觉问答与推理任务是跨模态智能研究的一个重要组成部分，也是人类实现人工智能的重要基础，已成功应用于人机交互、智能医疗以及无人驾驶等领域。本文对视觉问答与推理的相关算法进行了全面概括和归类分析。首先，介绍了视觉问答与推理的定义，并简述了当前该任务面临的挑战；其次，从基于注意力机制、基于图网络、基于预训练、基于外部知识库和基于可解释推理机制5个方面对现有方法进行总结和归纳；然后，全面介绍了视觉问答与推理常用公开数据集，并对相关数据集上的已有算法进行详细分析；最后，对视觉问答与推理任务的未来方向进行了展望。

Abstract:

With the rapid development of the social media and human-computer interaction， the volume of multimedia data， such as video， image and text， has grown tremendously. Therefore， researchers have focused their attention on the multi-modal intelligence research. As an essential and fundamental research topic in the multi-modal intelligence and artificial intelligence area， some scientific research results on the visual question answering and reasoning task have been successfully implemented in the fields of human-computer interaction， intelligent medical care， and unmanned driving. This paper makes a comprehensive overview of the related algorithms of visual question answering and reasoning， meanwhile classifies and analyzes the existing methods. Firstly， we introduce the definition of the visual question answering and reasoning task， and briefly describe the main challenges of this task. Then， we summarize the existing methods that focus on attention mechanism， graph network， model pretraining， external knowledge and explainable reasoning mechanism. After that， we comprehensively introduce the common visual question answering and reasoning benchmarks and discuss the existing methods on these benchmarks in detail. Finally， we prospect future directions of the visual question answering and reasoning task.

参考文献

相似文献

引证文献

引用本文

张飞飞,张建庆,屈思佳,周琬婷.跨模态视觉问答与推理研究进展[J].数据采集与处理,2023,38(1):1-20

复制

文章指标

点击次数:
下载次数:

历史

收稿日期:2022-10-28
最后修改日期:2022-12-09
录用日期:
在线发布日期: 2023-01-25

引用本文

分享

文章指标

历史