Using speech and text features fusion to improve speech emotion recognition
Author:
Affiliation:
School of Education Science,Nanjing Normal University
Fund Project:
Construct a situational evaluation environment for Chinese language proficiency with intelligent interaction technology
摘要
|
图/表
|
访问统计
|
参考文献
|
相似文献
|
引证文献
|
资源附件
摘要:
情感识别在人机交互中具有重要意义,为了提高情感识别准确率,将语音与文本特征融合。语音特征采用了声学特征和韵律特征,文本特征采用了基于情感词典的词袋特征(bag-of-words,BoW)和N-gram模型。将语音与文本特征分别进行特征层融合与决策层融合,比较它们在IEMOCAP四类情感识别的效果。实验表明,语音与文本特征融合比单一特征在情感识别中表现更好;决策层融合比在特征层融合识别效果好。且基于卷积神经网络(convolutional neural network,CNN)分类器,语音与文本特征在决策层融合中不加权平均召回率(Unweighted Average Recall,UAR)达到了68.98%,超过了此前在IEMOCAP数据集上的最好结果。
Abstract:
Emotion recognition has an important significance in human-computer interaction. The purpose of this study was to improve the accuracy of emotion recognition by fusing speech and text features. Speech features were acoustic features and phonological features ,and the text features were the traditional Bag-of-Words (BoW) features based on emotion dictionary and and N -gram model. We used these features to emotion recognition and compared their performance on the IEMOCAP data-sets. We also compared the effects of different features fusion methods,including feature-layer fusion and decision-layer fusion. Experiment results show that the performance of the fusion of speech and text features is better than that of single features; the performance of the decision-layer fusion of speech and text features is better than that of feature-layer fusion. At the same time, based on the CNN classifier, UAR of the decision-layer fusion with three features reach 68.98%, surpassing the previous best results on the IEMOCAP data sets.