Forged Speech Detection Algorithm Based on Time-Frequency Feature Fusion
CSTR:
Author:
Affiliation:

1.School of Computer Science, School of Cyber Science and Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China;2.Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science and Technology, Nanjing 210044, China;3.School of Software, Nanjing University of Information Science and Technology, Nanjing 210044, China;4.Institute of Artificial Intelligence, Guangzhou University, Guangzhou 510006, China;5.School of Foreign Languages, National University of Defense Technology, Nanjing 210039, China

Clc Number:

TN912

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    To solve the problem of low accuracy and weak generalization of forged speech detection, a new algorithm based on time-frequency feature fusion is proposed. Firstly, in order to excavate the uneven energy distribution of speech fragments or the abnormal fundamental frequency fluctuation, and extract the subtle difference of semantic coherence, a multi-branch feature fusion network is proposed to excavate the difference traces of true and false speech from the pitch, pitch intensity and energy distribution respectively, so as to better represent the frequency change, amplitude change and peak difference of true and false speeches, and improve the accuracy of forged speech detection. Secondly, the classical coordinate attention mechanism fails to effectively mine the fine-grained differences in the time-frequency domain of speech. Therefore, a time-frequency coordinate attention mechanism is proposed to jointly encode the energy distribution and fundamental frequency fluctuation anomalies from the time domain and the frequency domain respectively, so as to better characterize the common high frequency energy anomalies in the spectral graph and improve the generalization of the model. Finally, an adaptive joint loss optimization function is designed to balance the importance of different branch networks to further improve the model’s ability to learn high frequency energy anomalies and semantic incoherence in forged speech. Performance is evaluated on the logical access (LA) dataset of ASVspoof 2019, and experimental results show that compared with the current methods, the proposed method achieves good performance in both EER(Equal error rate) and mint-DCF(Minimum normalized tandem detection cost function) indicators, which decrease by 0.34% and 0.014, respectively. In addition, when dealing with unknown attack A17, which is extremely difficult to detect, it also show good generalization, where EER and mint-DCF decrease by 3.952 2% and 0.136 4, respectively. When dealing with unknown types of spoofing attacks, it also shows better generalization.

    Reference
    Related
    Cited by
Get Citation

YUAN Chengsheng, ZHANG Xueyuan, ZHOU Zhili, LI Xinting, FU Zhangjie. Forged Speech Detection Algorithm Based on Time-Frequency Feature Fusion[J].,2025,40(6):1538-1555.

Copy
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:October 13,2024
  • Revised:December 26,2024
  • Adopted:
  • Online: December 10,2025
  • Published:
Article QR Code