Abstract:Emotion recognition has an important significance in human-computer interaction. The purpose of this study was to improve the accuracy of emotion recognition by fusing speech and text features. Speech features were acoustic features and phonological features, and the text features were the traditional Bag-of-Words (BoW) features based on emotion dictionary and N-gram model. We used these features to emotion recognition and compared their performance on the IEMOCAP data-sets. We also compared the effects of different features fusion methods, including feature-layer fusion and decision-layer fusion. Experiment results show that the performance of the fusion of speech and text features is better than that of single features; the performance of the decision-layer fusion of speech and text features is better than that of feature-layer fusion. At the same time, based on the CNN classifier, UAR of the decision-layer fusion with three features reaches 68.98%, surpassing the previous best results on the IEMOCAP data sets.