基于细粒度视觉与音视双分支融合的情感视频字幕生成

doi:10.16337/j.1004-9037.2025.05.005

首页 > 按月查看>2025年第5月 >1165-1176. DOI:10.16337/j.1004-9037.2025.05.005

基于细粒度视觉与音视双分支融合的情感视频字幕生成
DOI:
                        10.16337/j.1004-9037.2025.05.005
                    
作者:
                        
                        
                    
作者单位:杭州电子科技大学计算机学院，杭州 310018
作者简介:
通讯作者:
基金项目:

Emotional Video Captioning Based on Fine-Grained Visual and Audio-Visual Dual-Branch Fusion

Author:

Affiliation:

School of Computer Science, Hangzhou Dianzi University, Hangzhou 310018, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

情感视频字幕生成作为融合视觉语义解析与情感感知的跨模态任务，其核心挑战在于精准捕捉视觉内容中蕴含的情感线索。现有方法存在两点显著不足：一是对视频中主体（人物、物体等）与其外观特征、动作特征间的细粒度语义关联挖掘不够充分，导致视觉内容理解缺乏精细化支撑；二是忽视了音频模态在情感判别与内容语义对齐中的辅助价值，限制了跨模态信息的综合利用。针对上述问题，本文提出细粒度视觉与音视双分支融合框架。其中，细粒度视觉特征融合模块通过视觉、物体、动作特征的两两交互与深度融合，有效建模视频实体与视觉上下文间的细粒度语义关联，实现对视频内容的精细化解析；音频-视觉双分支全局融合模块则构建跨模态交互通道，将整合后的视觉特征与音频特征进行深层融合，充分发挥音频信息在情感线索传递与语义约束上的补充作用。在公开基准数据集上对本文方法进行验证，其评价指标均优于CANet、EPAN等对比方法，情感指标比EPAN方法平均提高4%，语义指标平均提升0.5，综合指标平均提升0.7。实验结果表明本文方法能有效提升情感视频字幕生成的质量。

Abstract:

Emotional video captioning， as a cross-modal task integrating visual semantic parsing and emotional perception， faces the core challenge of accurately capturing the emotional cues embedded in visual content. Existing methods have two notable limitations： First， they insufficiently explore the fine-grained semantic correlations between video subjects （such as humans and objects） and their appearance and motion features， leading to a lack of refined support for visual content understanding； second， they neglect the auxiliary value of the audio modality in emotional discrimination and content semantic alignment， which restricts the comprehensive utilization of cross-modal information. To address these issues， this paper proposes a framework based on fine-grained visual and audio-visual dual-branch fusion. Specifically， the fine-grained visual feature fusion module effectively models the fine-grained semantic associations between video entities and visual contexts through pairwise interactions and deep integration of visual， object， and motion features， thereby achieving refined parsing of video content. The audio-visual dual-branch global fusion module constructs a cross-modal interaction channel to deeply fuse the integrated visual features with audio features， fully leveraging the supplementary role of audio information in emotional cue transmission and semantic constraint. Validation experiments on public benchmark datasets show that the proposed method outperforms comparative methods such as CANet and EPAN across evaluation metrics. It achieves an average improvement of 4% over EPAN method in emotional metrics， an average increase of 0.5 in semantic metrics， and an average boost of 0.7 in comprehensive metrics. Experimental results demonstrate that the proposed method can effectively enhance the quality of emotional video captioning.

参考文献

相似文献

引证文献

引用本文

龚禹轩,韩婷婷.基于细粒度视觉与音视双分支融合的情感视频字幕生成[J].数据采集与处理,2025,40(5):1165-1176

复制

文章指标

点击次数:
下载次数:

历史

收稿日期:2025-06-15
最后修改日期:2025-08-30
录用日期:
在线发布日期: 2025-10-15

引用本文

分享

文章指标

历史