一种基于特征融合的声音事件检测方法

doi:10.16337/j.1004-9037.2025.06.014

首页 > 按月查看>2025年第6月 >1556-1567. DOI:10.16337/j.1004-9037.2025.06.014

一种基于特征融合的声音事件检测方法
DOI:
                        10.16337/j.1004-9037.2025.06.014
                    
作者:
                        
                        
                    
作者单位:上海海事大学信息工程学院，上海201306
作者简介:
通讯作者:
基金项目:国家自然科学基金(62101316)。

Sound Event Detection Method Based on Feature Fusion

Author:

Affiliation:

College of Information Engineering, Shanghai Maritime University, Shanghai 201306,China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

现有的基于深度学习的声音事件检测方法多使用传统的二维卷积，然而其平移不变性的特点并不适用于声音信号，这使得模型难以检测复杂的声音事件。针对上述问题，本文提出一种基于特征融合的混合卷积神经网络模型，通过计算频谱图的分布来自适应地生成卷积核，动态地提取与声音信号保持物理一致性的局部特征；同时并行地使用自注意力算法提取全局特征，捕获频谱图的长距离特征关联；为消除局部特征与全局特征的语义差异，将两种不同的特征表示有效结合，提出一种特征融合模块。为进一步提升模型对声音事件的检测性能，提出一种基于多尺度注意力机制的双向门控单元，对融合后的特征信息进行充分整合，突出事件帧并抑制背景帧。在DCASE2020数据集上的实验结果表明，本文方法的F₁分数达到52.57%，优于现有的其他方法。

Abstract:

Most existing deep learning-based sound event detection methods adopt the conventional 2D convolution. However， its inherent translation invariance property is incompatible with audio signals， rendering the model incompetent in detecting complex sound events. To address the issue， a hybrid convolutional neural network based on feature fusion is proposed. Specifically， by calculating the distribution of the audio spectrogram and adaptively generating convolutional kernels， the proposed model dynamically extracts local features that maintain physical consistency with the audio signal. Meanwhile， the self-attention mechanism is employed in parallel to capture long-distance feature dependencies of the spectrogram. To eliminate the semantic gap between local and global features， a feature fusion module is designed to effectively integrate these two distinct feature representations. Furthermore， to further enhance the detection performance of the proposed model， an improved bidirectional gated recurrent unit based on a multi-scale attention mechanism is proposed to fully refine the fused feature information， which emphasizes event-related frames and suppresses background frames. Experiment results on the DCASE2020 dataset indicate that the proposed model has achieved an F₁-score of 52.57%， which outperforms other existing methods.

参考文献

相似文献

引证文献

引用本文

赵明,陈睿.一种基于特征融合的声音事件检测方法[J].数据采集与处理,2025,40(6):1556-1567

复制

文章指标

点击次数:
下载次数:

历史

收稿日期:2024-12-25
最后修改日期:2025-03-19
录用日期:
在线发布日期: 2025-12-10

引用本文

分享

文章指标

历史