视觉注意与语义感知联合推理实现场景文本识别

doi:10.16337/j.1004-9037.2023.03.015

首页 > 按月查看>2023年第3月 >665-675. DOI:10.16337/j.1004-9037.2023.03.015

视觉注意与语义感知联合推理实现场景文本识别
DOI:
                        10.16337/j.1004-9037.2023.03.015
                    
作者:
                        
                        
                    
作者单位:上海理工大学光电信息与计算机工程学院，上海200093
作者简介:
通讯作者:
基金项目:国家重点研发计划（2018YFB1700902）。

Joint Inference of Visual Attention and Semantic Perception for Scene Text Recognition

Author:

Affiliation:

College of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

场景中的不规则文本识别仍然是一个具有挑战性的问题。针对场景中的任意形状以及低质量文本，本文提出了融合视觉注意模块与语义感知模块的多模态网络模型。视觉注意模块采用基于并行注意的方式，与位置感知编码结合提取图像的视觉特征。基于弱监督学习的语义感知模块用于学习语言信息以弥补视觉特征的缺陷，采用基于Transformer的变体，通过随机遮罩单词中的一个字符进行训练提高模型的上下文语义推理能力。视觉语义融合模块通过选通机制将不同模态的信息进行交互以产生用于字符预测的鲁棒特征。通过大量的实验证明，所提出的方法可以有效地对任意形状和低质量的场景文本进行识别，并且在多个基准数据集上获得了具有竞争力的结果。特别地，对于包含低质量文本的数据集SVT和SVTP，识别准确率分别达到了93.6%和86.2%。与只使用视觉模块的模型相比，准确率分别提升了3.5%和3.9%，充分表明了语义信息对于文本识别的重要性。

Abstract:

Irregular text recognition in scenes is still a challenging problem. For arbitrary shapes and low-quality text in scenes， this paper proposes a multimodal network that combines a visual attention module and a semantic perception module. The visual attention module uses a parallel attention-based approach to extract visual features of images combined with positional encoding. The semantic perception module based on weak supervised learning is used to learn linguistic information to compensate for the deficiencies of visual features. The module uses a Transformer-based variant that improves the model’s contextual semantic inference by randomly masking a character in a word for training. The visual semantic fusion module interacts information from different modalities through a gating mechanism to generate robust features for character prediction. The proposed approach is demonstrated through extensive experiments to be effective in recognizing arbitrarily shaped and low-quality scene text， and competitive results are obtained on several benchmark datasets. In particular， accuracy rates of 93.6% and 86.2% are achieved for the datasets SVT and SVTP， which contain low-quality text， respectively. Compared with the method containing only the visual module， the accuracy is improved by 3.5% and 3.9%， respectively， which fully demonstrates the importance of semantic information for text recognition.

参考文献

相似文献

引证文献

引用本文

佟国香,董田荣,胡珩彰.视觉注意与语义感知联合推理实现场景文本识别[J].数据采集与处理,2023,38(3):665-675

复制

文章指标

点击次数:
下载次数:

历史

收稿日期:2022-11-23
最后修改日期:2023-03-21
录用日期:
在线发布日期: 2023-05-25

引用本文

分享

文章指标

历史