Forged Speech Detection Algorithm Based on Time-Frequency Feature Fusion

doi:10.16337/j.1004-9037.2025.06.013

Home > Archive>Volume 40, Issue 6, 2025 >1538-1555. DOI:10.16337/j.1004-9037.2025.06.013

Forged Speech Detection Algorithm Based on Time-Frequency Feature Fusion
DOI:
                        10.16337/j.1004-9037.2025.06.013
                    
CSTR:
                        
Author:
                        
Affiliation:1.School of Computer Science, School of Cyber Science and Engineering, Nanjing University of Information Science and Technology, Nanjing 210044, China;2.Engineering Research Center of Digital Forensics, Ministry of Education, Nanjing University of Information Science and Technology, Nanjing 210044, China;3.School of Software, Nanjing University of Information Science and Technology, Nanjing 210044, China;4.Institute of Artificial Intelligence, Guangzhou University, Guangzhou 510006, China;5.School of Foreign Languages, National University of Defense Technology, Nanjing 210039, China
Clc Number:TN912
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

To solve the problem of low accuracy and weak generalization of forged speech detection， a new algorithm based on time-frequency feature fusion is proposed. Firstly， in order to excavate the uneven energy distribution of speech fragments or the abnormal fundamental frequency fluctuation， and extract the subtle difference of semantic coherence， a multi-branch feature fusion network is proposed to excavate the difference traces of true and false speech from the pitch， pitch intensity and energy distribution respectively， so as to better represent the frequency change， amplitude change and peak difference of true and false speeches， and improve the accuracy of forged speech detection. Secondly， the classical coordinate attention mechanism fails to effectively mine the fine-grained differences in the time-frequency domain of speech. Therefore， a time-frequency coordinate attention mechanism is proposed to jointly encode the energy distribution and fundamental frequency fluctuation anomalies from the time domain and the frequency domain respectively， so as to better characterize the common high frequency energy anomalies in the spectral graph and improve the generalization of the model. Finally， an adaptive joint loss optimization function is designed to balance the importance of different branch networks to further improve the model’s ability to learn high frequency energy anomalies and semantic incoherence in forged speech. Performance is evaluated on the logical access （LA） dataset of ASVspoof 2019， and experimental results show that compared with the current methods， the proposed method achieves good performance in both EER（Equal error rate） and mint-DCF（Minimum normalized tandem detection cost function） indicators， which decrease by 0.34% and 0.014， respectively. In addition， when dealing with unknown attack A17， which is extremely difficult to detect， it also show good generalization， where EER and mint-DCF decrease by 3.952 2% and 0.136 4， respectively. When dealing with unknown types of spoofing attacks， it also shows better generalization.

Reference

Cited by

Get Citation

YUAN Chengsheng, ZHANG Xueyuan, ZHOU Zhili, LI Xinting, FU Zhangjie. Forged Speech Detection Algorithm Based on Time-Frequency Feature Fusion[J]. Journal of Data Acquisition and Processing,2025,40(6):1538-1555.

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:October 13,2024
Revised:December 26,2024
Adopted:
Online: December 10,2025
Published:

For Authors

Get Citation

Related Videos

Share

Article Metrics

History

Article QR Code