Traditional speech document classification systems are usually completed through the transcribed text from speech recognition systems, which suffer from the recognition errors. Although the fusion of speech and recognized text can reduce the impact of recognition errors to some extent, the fusion that is made at the level of representation vector does not take full advantage of the complementarity between speech and text information. A neural network spoken document classification system based on the fusion of acoustic feature and deep feature is proposed in this paper. In the training procedure of the neural network,a trained acoustic model is first adopted to generate deep feature that contains semantic information for each document. Then acoustic feature and deep feature of each spoken document are fused frame by frame through the gating mechanism. Finally, the fused feature is used for spoken document classification. The proposed system is evaluated on a speech news broadcast corpus. The experimental result showed that the proposed system was obviously superior to the spoken document classification systems based on the fusion of speech and text, and the final accuracy reached 97.27%.
表 1 不同模型的实验结果Table 1 Results of different models
图1 基于语音和识别文本融合的语音文档分类系统结构图Fig.1 Architecture of spoken document classification system based on fusion of speech and recognized text
图2 融合声学特征和深度特征的语音文档分类系统结构Fig.2 Architecture of spoken document classification system based on fusion of acoustic features and deep features