Speech emotion recognition (SER) is the key point for computer to understand human emotion, and it is also important in human-computer interaction. When the emotional speech signal transforms in the different media, the recognition accuracy of traditional deep learning model is not high enough, and the migration ability is not strong. Here, an acoustic wave equation emotion recognition model, i.e., image saliency gated recurrent acoustic wave equation emotion recognition (ISGR-AWEER) model is designed. The model is composed of image saliency extraction and gated recurrent model. The first part simulates the attention mechanism, which is used to extract the salient regions in speech. An acoustic wave equation emotion recognition model is designed. The model simulates the recurrent neural network, which can effectively improve the accuracy of SER in cross-media, and can quickly realize the model migration in cross-media. The effectiveness of the current model is verified by the experiments on the interactive emotional dynamic motion capture emotional corpus and the self-built multi-media emotional speech corpus. Compared with recurrent neural network, the accuracy of emotion recognition is improved by 25%, and it has a strong ability of cross-media migration.
表 2 流行SER模型的UA对比Table 2 UA of popular SER
表 3 多介质语料库中的情感识别实验结果Table 3 Experimental results of speech emotion recognition in multi-media emotional speech corpus
表 1 两个语料库中的情感识别实验结果Table 1 Experimental results of speech emotion recognition in two emotional speech corpus
图1 ISGR-AWEER模型整体结构Fig.1 Model structure of ISGR-AWEER
图2 Angry和happy类别的图像表达Fig.2 Image expression of angry and happy class
图3 Angry和happy类别的显著性信号区域表达Fig.3 Expression of significant regions in angry and happy class
图4 声波动模型结构Fig.4 Structure of acoustic wave equation model
图5 自建语料库的情感识别混淆矩阵Fig.5 Emotion recognition confusion matrix based on self-built corpus
图6 IEMOCAP情感识别混淆矩阵Fig.6 Emotion recognition confusion matrix based on IEMOCAP