基于时频特征融合的伪造语音检测方法
DOI:
作者:
作者单位:

1.南京信息工程大学;2.广州大学;3.国防科技大学外国语学院

作者简介:

通讯作者:

基金项目:

国家自然科学基金项目(62102189; 62122032),国家社会科学基金(2022-SKJJ-C-082),国科大科研项目(JS21-4; ZK21-43)


Forged Speech Detection Algorithm Based on Time-Frequency Feature Fusion
Author:
Affiliation:

1.Nanjing University of Information Science and Technology;2.Guangzhou University

Fund Project:

National Natural Science Foundation of China (62102189; 62122032);National Social Science Foundation of China (2022-SKJJ-C-082);Scientific Research Project of USTC (JS21-4; ZK21-43)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    目的 针对伪造语音检测精度不高和泛化性弱的难题,提出一种基于时频特征融合的伪造语音检测算法。方法 首先,为了挖掘语音片段能量分布不均、基频波动异常,以及提取语义连贯性的细微差别,提出一种多分支特征融合网络,分别从音高、音强以及能量分布来挖掘真假语音的差异痕迹,以更好地表征真假语音的频率变化、振幅变化和峰值差异,提高伪造语音检测的准确率。其次,经典的坐标注意力机制未能对语音时频域的细粒度差异进行有效挖掘,为此提出一种时频坐标注意力机制,分别从时域和频域两个方向对能量分布和基频波动异常进行联合编码,以更好地表征频谱图中的共性高频能量异常,提升模型的泛化性。最后,又设计了一种自适应联合损失优化函数,通过平衡不同分支网络的权重,进一步提升模型对伪造语音中高频能量异常及语义不连贯性的学习能力。结果 在ASVspoof 2019的逻辑访问数据集上进行了性能评估,实验结果表明,与现有的工作相比,所提方法在等错误率(Equal Error Rate, EER)和最小归一化串联检测代价函数(Minimum Normalized Tandem Detection Cost Function, min t-DCF)两个指标均取得较好性能,分别降低了0.34%和0.014。此外,在应对极难检测的未知攻击A17时,同样展现出较高的泛化性,其中EER和min t-DCF分别下降了3.9522%、0.1364。结论 在伪造语音检测方面,所提方法展现出最好的检测性能。此外,当应对未知类型的欺骗攻击时,同样表现出较好的泛化性。

    Abstract:

    Purpose To solve the problem of low accuracy and weak generalization of forged speech detection, a new algorithm based on time-frequency feature fusion is proposed. Method Firstly, in order to excavate the uneven energy distribution of speech fragments or the abnormal fundamental frequency fluctuation, and extract the subtle difference of semantic coherence, a multi-branch feature fusion network is proposed to excavate the difference traces of true and false speech from the pitch, pitch intensity and energy distribution respectively, so as to better represent the frequency change, amplitude change and peak difference of true and false speech, and improve the accuracy of forged speech detection. Secondly, the classical coordinate attention mechanism fails to effectively mine the fine-grained differences in the time-frequency domain of speech. Therefore, a time-frequency coordinate attention mechanism is proposed to jointly encode the energy distribution and fundamental frequency fluctuation anomalies from the time domain and the frequency domain respectively, so as to better characterize the common high frequency energy anomalies in the spectral graph and improve the generalization of the model. Finally, an adaptive joint loss optimization function is designed to balance the importance of different branch networks to further improve the model's ability to learn high frequency energy anomalies and semantic incoherence in forged speech. Results Performance was evaluated on the logical access (LA) dataset of ASVspoof 2019, and experimental results showed that compared with the current methods, the proposed method achieved good performance in both EER(Equal Error Rate) and mint-DCF(Minimum Normalized Tandem Detection Cost Function) indicators, which decreased by 0.34% and 0.014, respectively. In addition, when dealing with unknown attack A17, which is extremely difficult to detect, it also showed good generalization, where EER and mint-DCF decreased by 3.9522% and 0.1364, respectively. Conclusion In the aspect of forged speech detection, the proposed method shows the better detection performance. In addition, when dealing with unknown types of spoofing attacks, it also shows better generalization.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2024-10-13
  • 最后修改日期:2025-01-12
  • 录用日期:2025-01-13
  • 在线发布日期: