利用语音与文本特征融合改善语音情感识别
作者:
作者单位:

南京师范大学教育科学学院,南京,210097

作者简介:

通讯作者:

基金项目:

国家社会科学基金 BCA150054国家社会科学基金(BCA150054)资助项目。


Using Speech and Text Features Fusion to Improve Speech Emotion Recognition
Author:
Affiliation:

School of Education Science, Nanjing Normal University, Nanjing, 210097, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    情感识别在人机交互中具有重要意义,为了提高情感识别准确率,将语音与文本特征融合。语音特征采用了声学特征和韵律特征,文本特征采用了基于情感词典的词袋特征(Bag-of-words,BoW)和N-gram模型。将语音与文本特征分别进行特征层融合与决策层融合,比较它们在IEMOCAP四类情感识别的效果。实验表明,语音与文本特征融合比单一特征在情感识别中表现更好;决策层融合比在特征层融合识别效果好。且基于卷积神经网络(Convolutional neural network,CNN)分类器,语音与文本特征在决策层融合中不加权平均召回率(Unweighted average recall,UAR)达到了68.98%,超过了此前在IEMOCAP数据集上的最好结果。

    Abstract:

    Emotion recognition has an important significance in human-computer interaction. The purpose of this study was to improve the accuracy of emotion recognition by fusing speech and text features. Speech features were acoustic features and phonological features, and the text features were the traditional Bag-of-Words (BoW) features based on emotion dictionary and N-gram model. We used these features to emotion recognition and compared their performance on the IEMOCAP data-sets. We also compared the effects of different features fusion methods, including feature-layer fusion and decision-layer fusion. Experiment results show that the performance of the fusion of speech and text features is better than that of single features; the performance of the decision-layer fusion of speech and text features is better than that of feature-layer fusion. At the same time, based on the CNN classifier, UAR of the decision-layer fusion with three features reaches 68.98%, surpassing the previous best results on the IEMOCAP data sets.

    参考文献
    相似文献
    引证文献
引用本文

冯亚琴,沈凌洁,胡婷婷,王蔚.利用语音与文本特征融合改善语音情感识别[J].数据采集与处理,2019,34(4):625-631

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2018-01-21
  • 最后修改日期:2018-04-04
  • 录用日期:
  • 在线发布日期: 2019-09-01