Abstract:To solve the problem of low accuracy and weak generalization of forged speech detection, a new algorithm based on time-frequency feature fusion is proposed. Firstly, in order to excavate the uneven energy distribution of speech fragments or the abnormal fundamental frequency fluctuation, and extract the subtle difference of semantic coherence, a multi-branch feature fusion network is proposed to excavate the difference traces of true and false speech from the pitch, pitch intensity and energy distribution respectively, so as to better represent the frequency change, amplitude change and peak difference of true and false speech, and improve the accuracy of forged speech detection. Secondly, the classical coordinate attention mechanism fails to effectively mine the fine-grained differences in the time-frequency domain of speech. Therefore, a time-frequency coordinate attention mechanism is proposed to jointly encode the energy distribution and fundamental frequency fluctuation anomalies from the time domain and the frequency domain respectively, so as to better characterize the common high frequency energy anomalies in the spectral graph and improve the generalization of the model. Finally, an adaptive joint loss optimization function is designed to balance the importance of different branch networks to further improve the model's ability to learn high frequency energy anomalies and semantic incoherence in forged speech. Performance was evaluated on the logical access (LA) dataset of ASVspoof 2019, and experimental results showed that compared with the current methods, the proposed method achieved good performance in both EER(Equal Error Rate) and mint-DCF(Minimum Normalized Tandem Detection Cost Function) indicators, which decreased by 0.34% and 0.014, respectively. In addition, when dealing with unknown attack A17, which is extremely difficult to detect, it also showed good generalization, where EER and mint-DCF decreased by 3.9522% and 0.1364, respectively. When dealing with unknown types of spoofing attacks, it also shows better generalization.