融合声学特征和深度特征的语音文档分类

doi:10.16337/j.1004-9037.2021.05.008

首页 > 按月查看>2021年第5月 >932-938. DOI:10.16337/j.1004-9037.2021.05.008

融合声学特征和深度特征的语音文档分类
DOI:
                        10.16337/j.1004-9037.2021.05.008
                    
作者:
                        
                        
                    
作者单位:中国科学技术大学语音及语言信息处理国家工程实验室，合肥 230027
作者简介:
通讯作者:
基金项目:

Spoken Document Classification Based on Fusion of Acoustic Features and Deep Features

Author:

Affiliation:

National Engineering Laboratory for Speech and Language Information Processing, University of Science and Technology of China, Heifei 230027, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

传统的语音文档分类系统通常是基于语音识别系统所转录的文本实现的，识别错误会严重影响到这类系统的性能。尽管将语音和识别文本融合可以一定程度上减轻识别错误的影响，但大多数融合都是在表示向量层面融合，没有充分利用语音声学和语义信息之间的互补性。本文提出融合声学特征和深度特征的神经网络语音文档分类，在神经网络训练中，首先采用训练好的声学模型为每个语音文档提取包含语义信息的深度特征，然后将语音文档的声学特征和深度特征通过门控机制逐帧进行融合，融合后的特征用于语音文档分类。在语音新闻播报语料集上进行实验，本文提出的系统明显优于基于语音和文本融合的语音文档分类系统，最终的分类准确率达到97.27%。

Abstract:

Traditional speech document classification systems are usually completed through the transcribed text from speech recognition systems， which suffer from the recognition errors. Although the fusion of speech and recognized text can reduce the impact of recognition errors to some extent， the fusion that is made at the level of representation vector does not take full advantage of the complementarity between speech and text information. A neural network spoken document classification system based on the fusion of acoustic feature and deep feature is proposed in this paper. In the training procedure of the neural network，a trained acoustic model is first adopted to generate deep feature that contains semantic information for each document. Then acoustic feature and deep feature of each spoken document are fused frame by frame through the gating mechanism. Finally， the fused feature is used for spoken document classification. The proposed system is evaluated on a speech news broadcast corpus. The experimental result showed that the proposed system was obviously superior to the spoken document classification systems based on the fusion of speech and text， and the final accuracy reached 97.27%.