Content-Dependent x-vector for Text-Dependent Speaker Verification
Author:
Affiliation:
University of Science and Technology of China, National Engineering Laboratory for Speech and Language Information Processing,Hefei, 230027, China
Fund Project:
摘要
|
图/表
|
访问统计
|
参考文献
|
相似文献
|
引证文献
|
资源附件
摘要:
x-vector系统将一段不定长的语音通过神经网络映射成固定维的矢量来表征说话人信息,该系统在文本无关的说话人确认(Speaker verification, SV)任务中取得了优异的性能。本文将其应用到文本相关的SV任务中,在x-vector模型选择上,采用残差神经网络以获得更有区分性的x-vector;在包含多字符的语句中,对每个字训练一个残差神经网络;在提取过程中,每一字单独提取一个x-vector并单独进行说话人判决,最后将多个判决得分进行融合后给出最终的识别结果。实验是在数据库RSR2015 Part Ⅲ 上进行的,提出的方法在男性和女性测试集上等错误率分别有15.34%、19.7%的下降。
Abstract:
The x-vector system maps a variable-length speech to a fixed-dimensional speaker embeddings via neural networks, and performs well in text-independent speaker verification. Here, it is applied to the text-dependent speaker verification and different x-vectors are extracted according to different contents in one sentence. In model selection, deep residual network (DRN) is used to obtain more discriminative x-vector. For a sentence with multiple words, word-dependent DRNs are trained to extract word-dependent x-vectors, which are separately fed to different backend classifiers. Finally, multiple scores are fused to obtain the final verification results. Experiments on Part Ⅲ of the RSR2015 dataset show that the proposed method can achieve equal error rate (EER) reduction of 15.34% and 19.7% for male and female, respectively.