基于声学和文本特征的多模态情感识别
作者:
作者单位:

1.江苏师范大学物理与电子工程学院, 徐州 221116;2.江苏师范大学科文学院, 徐州 221116;3.江苏师范大学 语言科学与艺术学院,徐州 221116

作者简介:

通讯作者:

基金项目:

国家自然科学基金青年基金项目(52005267);江苏省高校自然科学基金(18KJB510013,17KJB510018);校创新项目(2021XKT1250)。


Multimodal Emotion Recognition Based on Acoustic and Lexcial Features
Author:
Affiliation:

1.School of Physics and Electronic Engineering, Jiangsu Normal University, Xuzhou 221116, China;2.Kewen College, Jiangsu Normal University, Xuzhou 221116, China;3.School of Linguistic Sciences and Arts, Jiangsu Normal University, Xuzhou 221116, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    在语音模态中,利用OpenSMILE工具箱可以从语音信号中提取浅层声学特征,通过Transformer Encoder网络从浅层声学特征中挖掘深层特征,并将深浅层特征融合,从而获取更丰富的情感表征。在文本模态中,考虑到停顿因素与情感之间的关联性,将语音和文本对齐以获得说话停顿信息,采用停顿编码的方式将停顿信息添加到转录文本中,再通过DC-BERT模型获取话语级文本特征。将获得的声学与文本特征进行融合,利用基于注意力机制的双向长短时记忆(Bi-directional long short-term memory-attention,BiLSTM-ATT)神经网络进行情感分类。最后,本文对比了3种不同注意力机制融入BiLSTM网络后对情感识别的影响,即局部注意力、自注意力和多头自注意力,发现局部注意力的效果最优。实验表明,本文提出的方法在IEMOCAP数据集上的4类情感分类的加权准确率达到了78.7%,优于基线系统。

    Abstract:

    In the speech mode, the OpenSMILE toolbox is used to extract low-level acoustic features from the speech signal. Transformer Encoder is richer to excavate deep features from low level acoustic features and fuses them so as to obtain more useful emotional representation. In the text mode, considering the association between pause and emotion, the speech and text are aligned to obtain the pause information and the pause information is added to the transcript text by pause encoding. The utterance-level lexical features are obtained by the improved DC-BERT model. Then,acoustic features and lexical features are fused and the bi-directional long short-term memory based on attention neural network (BiLSTM-ATT) is used for emotion classification. Finally, this paper compares the effects of three different attention mechanisms integrated into BiLSTM on emotion recognition (local attention, self-attention and multi-headed attention),and local attention is found to be the most effective. In the experiments on IEMOCAP dataset, the method proposed in this paper achieves 78.7% in weighted accuracy for four emotion categories, which is better than the baseline system.

    参考文献
    相似文献
    引证文献
引用本文

顾煜,金赟,马勇,姜芳艽,俞佳佳.基于声学和文本特征的多模态情感识别[J].数据采集与处理,2022,37(6):1353-1362

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2022-01-04
  • 最后修改日期:2022-11-07
  • 录用日期:
  • 在线发布日期: 2022-11-25