基于多任务学习的语音情感识别

doi:10.16337/j.1004-9037.2024.02.015

首页 > 按月查看>2024年第2月 >424-432. DOI:10.16337/j.1004-9037.2024.02.015

基于多任务学习的语音情感识别
DOI:
                        10.16337/j.1004-9037.2024.02.015
                    
作者:
                        
                        
                    
作者单位:1.中国矿业大学信息与控制工程学院，徐州 221116;2.科大讯飞股份有限公司核心研发平台，合肥 230088
作者简介:
通讯作者:
基金项目:科技创新2030——“新一代人工智能”重大项目（2020AAA0107300）;徐州市基础研究计划项目(KC22020)。

Speech Emotion Recognition with Multi-task Learning

Author:

Affiliation:

1.School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China;2.Research & Development Group, iFLYTEK Co. Ltd., Hefei 230088, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

在近期的语音情感识别研究中，研究人员尝试利用深度学习模型从语音信号中识别情感。然而，传统基于单任务学习的模型对语音的声学情感信息关注度不足，导致情感识别的准确率较低。鉴于此，本文提出了一种基于多任务学习、端到端的语音情感识别网络，以挖掘语音中的声学情感，提升情感识别的准确率。为避免采用频域特征造成的信息损失，本文利用基于时域信号的Wav2vec2.0自监督网络作为模型的主干网络，提取语音的声学特征和语义特征，并利用注意力机制将两类特征进行融合作为自监督特征。为了充分利用语音中的声学情感信息，使用与情感有关的音素识别作为辅助任务，通过多任务学习挖掘自监督特征中的声学情感。在公开数据集IEMOCAP上的实验结果表明，本文提出的多任务学习模型实现了76.0%的加权准确率和76.9%的非加权准确率，相比传统单任务学习模型性能得到了明显提升。同时，消融实验验证了辅助任务和自监督网络微调策略的有效性。

Abstract:

In recent speech emotion recognition， researchers attempt to identify emotion from speech signals using deep learning models. However， traditional single-task learning-based models do not pay enough attention to speech acoustic emotional information， resulting in low accuracy of emotion recognition. In view of this， this paper proposes a multi-task learning， end-to-end speech emotion recognition network to mine acoustic emotion in speech and improve the accuracy of emotion recognition. In order to avoid the loss of information caused by using frequency domain features， this paper adopts the Wav2vec2.0 as the backbone network of the model to extract the acoustic and semantic features of speech， and the attention mechanism is used to integrate the two kinds of features as self-supervised features. To make full use of the acoustic sentiment information in speech， using emotion-related phoneme recognition as an auxiliary task， a multi-task learning model is used to mine acoustic sentiment in self-supervised features. Experimental results on the public dataset IEMOCAP show that， the proposed multi-task learning model achieves a weighted accuracy rate of 76.0% and an unweighted accuracy rate of 76.9%， with significantly improved model performance compared to the traditional single-task learning model. Meanwhile， ablation experiments verify the effectiveness of auxiliary task and self-supervised network fine-tuning strategy.

参考文献

相似文献

引证文献

引用本文

李云峰,闫祖龙,高天,方昕,邹亮.基于多任务学习的语音情感识别[J].数据采集与处理,2024,(2):424-432

复制

文章指标

点击次数:
下载次数:

历史

收稿日期:2022-12-11
最后修改日期:2023-04-25
录用日期:
在线发布日期: 2024-04-22

引用本文

分享

文章指标

历史