Special issue

1 An Acoustic Echo Cancellation System for Double-Talk Scenario

ZHOU Wenjun , XIA Xiuyu

2022, 37(2):437-445. DOI: 10.16337/j.1004-9037.2022.02.016

[Abstract](1217) [HTML](1489) [PDF 684.84 K](2275)

Abstract:
The double-talk scenario will deteriorate the performance of echo canceller in acoustic echo cancellation， while traditional double-talk detection and other methods of controlling the adaptive step-size cannot effectively deal with it. To solve this problem， a method of adjusting the adaptive step-size according to the spectral signal-to-interference ratio （the ratio of the near-end speech’s power spectrum to the echo’s power spectrum） is proposed. In order to reduce computational complexity and processing delay， the partitioned frequency block least mean square （PFBLMS） algorithm is used as the adaptive filtering algorithm. So the adaptive step-size is adjusted in the frequency domain. First， the relationship between the spectral signal-to-interference ratio and the coherence function is established. Second， the step-size is obtained through the coherence function. Third， the adaptive step-size of each frequency point is adjusted in real time according to the calculated value. In addition， the dual filter and the sparse control algorithms are combined to further improve robustness and convergence performance of the system. The computer simulation shows that the system can not only guarantee good echo suppression ability in the double-talk scenario， but also track the changes of the echo channel in time. Compared with the double-talk detection method based on the normalized cross-correlation function and the echo cancellation algorithm in the open source project Speex， the proposed system achieves better echo return loss enhancement （ERLE） and perceptual evaluation of speech quality （PESQ）.

2 Logical Access Attack Audio Detection Based on LSTM-GRU

YANG Haitao , WANG Huapeng , NIU Jinlin , CHU Xianteng , LIN Nuanhui

2022, 37(2):396-404. DOI: 10.16337/j.1004-9037.2022.02.012

[Abstract](1001) [HTML](1831) [PDF 1.13 M](2171)

Abstract:
In order to improve the accuracy of speech spoofing detection， a speech spoofing detection method based on LSTM-GRU network is proposed. LSTM-GRU network is a hybrid network combining long short-term memory（LSTM） layer， gated recurrent unit （GRU） layer， dropout layer， batch normalization layer and dense layer in series. LSTM layer can solve the problem of longtime dependence in speech sequence， while GRU layer can reduce the number of model parameters. The experiment is conducted on the ASVspoof2019 LA dataset， and the 20-dimensional Mel-frequency cepstral coefficient features are extracted for model training. In the test stage， the trained LSTM-GRU model is used for deception detection of the speech in the test set. By comparing with separate GRU and LSTM networks， the results show that： LSTM-GRU network achieves the highest correct recognition rate among the three network models； the equal error rate is 27.07% lower than the baseline system provided by the ASVspoof2019 challenge； the average accuracy of speech detection for logical access attack is 98.04%； LSTM-GRU network has the advantages of short training time， over-fitting prevention and high stability. It is proved that the proposed method can be effectively applied to speech logical access attack detection task.

3 Multi-feature Fusion Speech Emotion Recognition Based on Deep Residual Shrinkage Network

LI Ruihang , WU Honglan , SUN Youchao , WU Huacong

2022, 37(3):542-554. DOI: 10.16337/j.1004-9037.2022.03.005

[Abstract](1089) [HTML](677) [PDF 1.62 M](2447)

Abstract:
Aiming at the difference of speakers in speech emotion recognition task， calculate the first-order difference and second-order difference of spectral features to form three-channel feature sets and input the feature sets to the two-dimensional network. The convolutional neural network， bidirectional short and long memory network and attention mechanism were combined to establish a baseline model， and the deep residual shrinkage network was introduced to allocate channel weights in the two-dimensional network to further improve the accuracy of speech emotion recognition. In order to improve the learning effect of the model， two different information fusion mechanisms， feature layer fusion （Add and Concatenate） and decision layer fusion （Average and Maximum）， were adopted. The results show that ：（1） Add strategy in feature layer fusion is more effective；（2） The proposed model achieves 84.93% and 86.83% of unweighted average recall （UAR） in CASIA and EMO-DB databases respectively. Compared with the baseline model， the unweighted recall rates of CASIA and EMO-DB are increased by 5.3% and 6.2% respectively after introducing deep residual shrinkage network.

4 Expressive Speech Synthesis Method Based on Tacotron Model and Prosodic Correction

Zhang Xin , Hu Hangye , Cao Xinyi , Wang Wei

2022, 37(4):909-916. DOI: 10.16337/j.1004-9037.2022.04.018

[Abstract](1145) [HTML](1434) [PDF 1.78 M](2536)

Abstract:
Speech synthesis technology is becoming more mature. In order to improve the quality of synthetic emotional speech， this study proposes a method combining end-to-end emotional speech synthesis with prosodic correction. Based on the Tacotron model， the prosodic parameters are modified to improve the emotion expression power of the synthetic system. Tacotron model is first trained with a large neutral corpus， and then a small emotional corpus is used to train and synthesize emotional speech. Then the Praat acoustic analysis tool is used to analyze the prosodic features of emotional speech in the corpus and summarize the parameters of different emotional states. Finally， with the help of this rule， the fundamental frequency， duration and energy of the corresponding emotional speech synthesized by Tacotron are modified to make the emotional expression more accurate. The results of objective emotion recognition experiment and subjective evaluation show that this method can synthesize more natural and expressive emotional speech.

5 Interactive Dual-Branch Monaural Speech Enhancement Model Based on Critical Frequency Band

YE Zhongfu , ZHAO Ziwei , YU Runxiang

2023, 38(2):262-273. DOI: 10.16337/j.1004-9037.2023.02.003

[Abstract](678) [HTML](666) [PDF 1.25 M](1633)

Abstract:
Aiming at the problem that the current mainstream dual-branch single-channel speech enhancement methods only pay attention to the full frequency band information while ignoring the subband information， an interactive dual-branch model based on the critical frequency band of the human ear is proposed. The main method is to implement the division method of simulating the critical frequency band of the human ear on the complex spectrum branch to process the signal in frequency division and extract sub-band information. The whole frequency band of the signal is directly processed on the amplitude compensation branch， and the information of the whole frequency band is extracted. The complex spectrum branch is responsible for initially recovering the amplitude and phase of the clean speech signal. At the same time， the subband intermediate features learned by the branch are transferred to the amplitude compensation branch by specific modules for compensation. The output on the amplitude compensation branch will further compensate the amplitude of the output on the complex spectrum branch to achieve the purpose of recovering the clean speech spectrum. Experimental results show that the proposed model is superior to other advanced models in restoring speech quality and intelligibility.

6 A Light-Weight Full-Band Speech Enhancement Model

HU Qinwen , HOU Zhongshu , LE Xiaohuai , LU Jing

2023, 38(2):274-282. DOI: 10.16337/j.1004-9037.2023.02.004

[Abstract](1297) [HTML](648) [PDF 1.31 M](1728)

Abstract:
Deep neural network based full-band speech enhancement systems face challenges of high demand of computational resources and imbalanced frequency distribution. In this paper， a light-weight full-band model is proposed based on dual path convolutional recurrent network with two dedicated strategies， i.e.， a learnable spectral compression mapping for more effective high-band spectral information compression， and the utilization of the multi-head attention mechanism for more effective modeling of the global spectral pattern. Experiments validate the efficacy of the proposed strategies and show that the proposed model achieves competitive performance with only 0.89×10⁶ parameters.

7 Multi-scale Expressive Chinese Speech Synthesis

GAO Jie , XIAO Dajun , XU Xialing , LIU Shaohan , YANG Qun

2023, 38(6):1458-1468. DOI: 10.16337/j.1004-9037.2023.06.019

[Abstract](828) [HTML](527) [PDF 1.51 M](1266)

Abstract:
Common methods for enhancing the expressiveness of synthesized speech typically involve encoding the reference audio as a fixed-dimensional prosody embedding. This embedding is then fed into the decoder of the speech synthesis model along with the text embedding， thereby introducing prosody information into the speech synthesis process. However， this approach only captures prosody information at the global level of speech， neglecting fine-grained prosody details at the word or phoneme level. Consequently， the synthesized speech may still exhibit unnatural pronunciation and flat intonation in certain words. To tackle these issues， this paper introduces a multi-scale expressive Chinese speech synthesis method based on Tacontron2. Initially， two variational auto-encoders are employed to extract global-level prosody information and phoneme-level pitch information from the reference audio. This multi-scale variational information is then incorporated into the speech synthesis model. Additionally， during the training process， we minimize the mutual information between the rhyme embedding and the pitch embedding. This step aims to eliminate intercorrelation between different feature representations and to separate distinct feature representations. Experimental results demonstrate that our proposed method enhances the subjective mean opinion score by 2% and reduces the F₀ frame error rate by 14% compared to the single-scale expressive speech synthesis method. The findings suggest that our method generates speech that is more natural and expressive.

8 Speech Steganalysis Method for Echo Hiding Based on Image of Cepstrum

Tang Junhao , Du Qingzhi , Long Hua , Shao Yubin , Li Yimin

2023, 38(6):1469-1481. DOI: 10.16337/j.1004-9037.2023.06.020

[Abstract](825) [HTML](511) [PDF 2.88 M](1558)

Abstract:
After echo hiding， the cepstrum coefficient of a speech signal will peak at the echo delay. The traditional echo hiding steganalysis mainly uses the statistical characteristics of the cepstrum coefficient as the steganalysis feature. However， the peak value of the cepstrum coefficient of the steganography signal is not obvious when the echo amplitude is low， and the detection performance of the method based on the statistical characteristics is unsatisfactory. This paper combines cepstrum analysis with image recognition technology， and proposes an steganalysis method for speech echo hiding based on cepstrum image. The speech signal is divided into frames and windowed for cepstrum calculation. Then， the image is generated with time as the horizontal axis， cepstrum sequence points as the vertical axis， and cepstrum coefficient amplitude as the gray level. The generated cepstrum image is used as the steganalysis input， and residual neural network is used as the classifier for echo hiding steganalysis. The experimental results show that the detection accuracy of the three classical echo hiding algorithms reaches 98.2%， 98.6% and 96.1% respectively at low echo amplitude. The detection accuracy of this method at low echo amplitude is greatly improved compared with the traditional echo hiding steganalysis method， which solves the problem that the traditional echo hiding steganalysis method has unsatisfactory detection effect at low echo amplitude.

9 Zero Resource Korean ASR Based on Acoustic Model Sharing

Wang Haoyu , Jeon Eunah , Zhang Weiqiang , Li Ke , Huang Yukai

2023, 38(1):93-100. DOI: 10.16337/j.1004-9037.2023.01.007

[Abstract](902) [HTML](810) [PDF 1.22 M](2163)

Abstract:
A precise speech recognition system usually is based on a large amount of training data with handcrafted transcription， which sets a barrier to the recognition of many low-resource languages. Acoustic model sharing， which is based on the similarity of certain rich and low resource language pair， provides a new method to solve the problem and helps to build an automatic speech recognition （ASR） system without any training data of the given low resource language. This paper expands the method to Korean speech recognition. Specifically， we train an acoustic model on Mandarin data， and lay down a set of mapping rules between Mandarin and Korean phonemes. A character error rate （CER） of 27.33% is achieved on Zeroth Korean test set without using any Korean speech data. Moreover， we also test the difference between source-to-target and target-to-source phoneme mapping rules， and prove that the latter is more appropriate for acoustic model sharing.

10 Multi-channel Speech Enhancement Based on Joint Graph Learning

ZHANG Pengcheng , GUO Haiyan , WANG Tingting , YANG Zhen

2023, 38(2):283-292. DOI: 10.16337/j.1004-9037.2023.02.005

[Abstract](883) [HTML](597) [PDF 1.30 M](1796)

Abstract:
Considering that the spatial relationship between channels affects the noise reduction， graph signal processing can capture the potential relationship. If the spatial physical distribution map is directly used， its time-varying characteristics cannot be reflected in real time. Therefore， we propose a multi-channel speech enhancement method based on joint graph learning. Firstly， we propose a joint time-space graph learning method， which jointly optimizes the array space graph and the speech frame inner graph， for the sake of minimizing the sum of the smoothness of the multi-channel noisy speech signal on the spatial graph， the smoothness of the nosiy speech signal from the reference channel on the speech frame graph， the sparsity of the Laplace matrix and the sparsity of the adjacency matrix. Based on the learned space graph and frame inner graph， the time-space joint graph of multi-channel speech signal is constructed. On this basis， the multi-channel speech graph signal is enhanced by applying the joint graph transform and the fixed beam forming （FBF） method. Experimental results show that the proposed joint graph learning based FBF （JGL-FBF） method can significantly improve the signal-to-noise ratio （SNR） of enhanced speech and perceptual evaluation of speech quality （PESQ） compared with the traditional FBF method. In addition， the experimental results also show that the accuracy of delay compensation affects the speech enhancement performance of JGL-FBF.

11 Vietnamese Speech Recognition Based on Pre-training and Phone-Based Byte-Pair Encoding

SHEN Zhijie , GUO Wu

2023, 38(1):101-110. DOI: 10.16337/j.1004-9037.2023.01.008

[Abstract](1083) [HTML](893) [PDF 893.81 K](1991)

Abstract:
Based on the unsupervised pre-training technology， wav2vec 2.0 has become a research hotspot for the state of the art performance in many low-resource languages. In this paper， the Vietnamese continuous speech recognition is carried out on the basis of the pre-trained model. The phonetics information is integrated into the connectionist temporal classification （CTC） loss function based acoustic modeling， and the phones and the position dependent phones are selected as the basic modeling units. To balance the number of modeling units and the refinement of the model， a byte-pair encoding （BPE） algorithm is used to generate phone based subwords， and the contextual information is integrated into the acoustic modeling process. Experiments are carried out on the low-resource Vietnamese development set of NIST’s BABEL task， and the proposed algorithm significantly improves the wav2vec 2.0 baseline system. The word error rate is reduced from 37.3% to 29.4%.

12 An Overview of Audio Steganography Methods: From Tradition to Deep Learning

ZHANG Xiongwei , GE Xiaoyi , SUN Meng , SONG GONG Kunkun , LI Li

2023, 38(5):995-1016. DOI: 10.16337/j.1004-9037.2023.05.001

[Abstract](1909) [HTML](1541) [PDF 1.93 M](2778)

Abstract:
As a widely used medium in the cyberspace， digital audio serves as an excellent cover for carrying secret information and is often employed in the construction of covert communication systems that prioritize real-time performance， low complexity， and imperceptibility. Audio steganography， one of the key techniques for ensuring network information security and confidential communication， has attracted increasing attention from scholars. This paper presents a systematic review of the development context of audio steganography methods. Firstly， we introduce the basic contents of audio steganography， and summarize the problem description， evaluation indicators， common data formats， and tools. Secondly， according to different embedding domains， traditional audio steganography methods are classified into time domain methods， transform domain methods and compression domain methods， and their advantages and disadvantages are analyzed. Furthermore， based on different steganographic covers， the deep learning-based steganography methods are categorized into embedding cover-based， generating cover-based， and coverless audio steganography， then the three steganography methods are compared and analyzed. Finally， suggestions for further research directions in audio steganography are pointed out.

13 Multi-channel Linear Prediction for Speech Dereverberation Using Cross-Band Filters and Sparse Priors

KANG Yao , KANG Fang , YANG Feiran

2024, 39(5):1135-1146. DOI: 10.16337/j.1004-9037.2024.05.007

[Abstract](811) [HTML](767) [PDF 3.06 M](904)

Abstract:
The multi-channel linear prediction （MCLP） is one of the most popular speech dereverberation methods. The band-to-band spectral subtraction model has been adopted by most existing studies to obtain the desired speech signal in each frequency band， but it neglects the interaction between different frequencies. This paper proposes a MCLP-based speech dereverberation method using the cross-band spectral subtraction model instead of the widely adopted band-to-band spectral subtraction model. The proposed model employs cross-band filters to account for the interactions between different frequencies. We model the desired signal using the complex generalized Gaussian （CGG） distribution. Compared with the Gaussian distribution， the CGG distribution can capture the sparse nature of speech signals using a suitable shape parameter. Within the maximum likelihood estimation framework， the speech dereverberation problem is formulated as an optimization problem involving the band-to-band and cross-band filters. An optimization algorithm with guaranteed convergence is derived based on the majorization-minimization method. A series of speech dereverberation experiments under various reverberation times， different channel numbers and different source-to-microphone distances demonstrate that the proposed method significantly outperforms traditional methods in terms of dereverberation performance.

14 Speech Emotion Recognition with Multi-task Learning

LI Yunfeng , YAN Zulong , GAO Tian , FANG Xin , ZOU Liang

2024, 39(2):424-432. DOI: 10.16337/j.1004-9037.2024.02.015

[Abstract](998) [HTML](901) [PDF 1.60 M](1336)

Abstract:
In recent speech emotion recognition， researchers attempt to identify emotion from speech signals using deep learning models. However， traditional single-task learning-based models do not pay enough attention to speech acoustic emotional information， resulting in low accuracy of emotion recognition. In view of this， this paper proposes a multi-task learning， end-to-end speech emotion recognition network to mine acoustic emotion in speech and improve the accuracy of emotion recognition. In order to avoid the loss of information caused by using frequency domain features， this paper adopts the Wav2vec2.0 as the backbone network of the model to extract the acoustic and semantic features of speech， and the attention mechanism is used to integrate the two kinds of features as self-supervised features. To make full use of the acoustic sentiment information in speech， using emotion-related phoneme recognition as an auxiliary task， a multi-task learning model is used to mine acoustic sentiment in self-supervised features. Experimental results on the public dataset IEMOCAP show that， the proposed multi-task learning model achieves a weighted accuracy rate of 76.0% and an unweighted accuracy rate of 76.9%， with significantly improved model performance compared to the traditional single-task learning model. Meanwhile， ablation experiments verify the effectiveness of auxiliary task and self-supervised network fine-tuning strategy.

15 Audio Adversarial Examples Generation Method Based on Self-attention Mechanism

LI Zhuhai , Guo Wu

2024, 39(2):416-423. DOI: 10.16337/j.1004-9037.2024.02.014

[Abstract](806) [HTML](889) [PDF 1.40 M](1221)

Abstract:
With the widespread of personal speech and development of automatic speaker recognition algorithms， personal privacy protection is in a high-risk situation. Audio adversarial examples can protect personal voiceprint features through disabling automatic speaker recognition algorithms while the subjective hearing of the human ear remains unchanged. We improve the typical adversarial attacks algorithm FoolHD with multi-head self-attention mechanism， and we call it FoolHD-MHSA. First， convolutional neural networks are introduced as the encoder to extract adversarial perturbation spectrograms. Second， we use self-attention mechanism to extract correlation features of different parts of perturbation spectrogram from a global perspective ， focus the network on the important information and suppress the useless information. Finally， the processed perturbation spectrogram is steganographed into the input spectrogram with a decoder to get adversarial example spectrogram. Experimental results show that FoolHD-MHSA can generate adversarial examples with higher attack success rate and average PESQ score than FoolHD.

16 An End-to-End Singing Voice Synthesis Method with Excitation and Vibrato Modeling

ZHOU Xiao , HU Yajun , PAN Jia , HU Guoping , LING Zhenhua

2024, 39(2):406-415. DOI: 10.16337/j.1004-9037.2024.02.013

[Abstract](596) [HTML](799) [PDF 1.96 M](945)

Abstract:
In recent years， singing voice synthesis technology has developed rapidly， and end-to-end singing voice synthesis （VISinger） based on variational inference and normalizing flow has become mainstream. But there is still a certain gap between its effect and the sound quality of real persons， which is mainly reflected in the discontinuous hearing of pitch， poor synthesis of vibrato， and unstable articulation in the synthesized singing voice.We propose three main improvements. Firstly， to address the problem of fundamental frequency stability， we propose to add an excitation module in the decoder to explicitly provide the fundamental frequency information to the decoder in the form of an excitation signal； secondly， to address the problem of unnatural vibrato synthesis， we add a vibrato prediction module to explicitly model the vibrato in the song using flow with variational data augmentation； thirdly， we further add a ReZero strategy to the frame prior network. Experimental results show that increasing the excitation signal can improve the stability of the synthesized fundamental frequency， the vibrato modeling has a significant enhancement effect on the recovery of vibrato， and the ReZero strategy has a certain improvement on the training speed and articulation stability. Subjective evaluation demonstrates that the proposed model has a significant advantage over VISinger in the naturalness of singing voice synthesis， with mean opinion score （MOS） reaching 3.95， and also has a significant advantage over the two-stage modeling method DiffSinger+HiFiGAN， proving the effectiveness of the proposed method.

17 A Survey on Sound Acquisition Theories and Application Methods of Distributed Microphone Arrays

ZHANG Jie , HU De , ZHANG Xiaolei , Ling Zhenhua

2024, 39(5):1085-1113. DOI: 10.16337/j.1004-9037.2024.05.004

[Abstract](2219) [HTML](1625) [PDF 2.65 M](1590)

Abstract:
Over the past few decades of development， microphone array technology is becoming more mature， which has been applied to various human-machine interaction systems， e.g.， video-conferencing， intelligent television， mobile telephony， hearing aids. However， in realistic noisy or distant interaction scenarios， the sound acquisition quality （SAQ） of conventional topology-constrained microphone arrays cannot be guaranteed. With the wide range of using wireless intelligent terminal devices， distributed microphone array （DMA） or so-called wireless acoustic sensor network （WASN） provides more possibilities of improving the SAQ for speech interaction systems in complex and open domains， and shows a superiority in array organization， application experience and scene coverage. Recently， DMA exhibits a good applicable potential in many speech interaction tasks， which almost cover the tasks that conventional microphone array can handle. This survey will mainly summarize some existing important sound acquisition theories and application methods of DMA， including principles of array organization， utility evaluation of microphone nodes and the application methods in combination of downstream speech tasks. Finally， we will briefly discuss some key challenges and developing trends of the road of DMA to practical usages.

18 Improved Degenerate Unmixing Estimation Technique Algorithm Based on Two-Step Single-Source Point Screening

WU Lifu , MA Sijia , SUN Kang

2024, 39(5):1114-1125. DOI: 10.16337/j.1004-9037.2024.05.005

[Abstract](812) [HTML](516) [PDF 4.09 M](890)

Abstract:
The degenerate unmixing estimation technique （DUET） algorithm is a typical underdetermined blind source separation algorithm. However， as a binary time-frequency mask-based method， DUET erroneously results in some interference signals retention. This paper proposes an improved DUET algorithm based on two-step single-source point screening. The cosine angle algorithm is used for the first-step single-source point screening， and then a similarity calculation method is employed for the second-step single-source point screening. After obtaining more accurate target and interference signals through two-step single-source point screening， the filter designed to cancel the interference signals replaces the binary time-frequency mask of DUET， achieving interference signal suppression and target signal extraction. Simulation results show that the proposed method has good performance in both positive definite and underdetermined blind source separation.

19 State of the Art and Prospects of Deep Learning-Based Speaker Verification

LI Jianchen , HAN Jiqing

2024, 39(5):1062-1084. DOI: 10.16337/j.1004-9037.2024.05.003

[Abstract](1547) [HTML](1108) [PDF 1.60 M](1479)

Abstract:
With the development of deep learning， speaker verification has made great progress. Compared with other biometric identification technologies， this technology has advantages of remote operation， low cost， easy human-computer interaction， etc.， thus it shows a wide range of application prospects in the fields of public security， criminal investigation， and financial services. A systematic overview of the development lineage of deep learning-based speaker verification techniques is provided. Firstly， the development history and research status of deep learning-based speaker representation model are introduced in four aspects： Model input and structure， pooling layer， supervised loss function， and self-supervised learning and pre-training model. Then， the challenges faced by speaker verification are discussed， such as cross-domain mismatch problems like noise interference， channel mismatch and far-field speech， and the corresponding domain adaptation and domain generalization methods are outlined. Finally， the further research directions are presented.

20 Kalman-Filter-Based Acoustic Feedback Cancellation with State Detection Model for Fast Recovery from Abrupt Path Changes

GUO Haocheng , CHEN Kai , LU Jing

2024, 39(5):1126-1134. DOI: 10.16337/j.1004-9037.2024.05.006

[Abstract](1292) [HTML](698) [PDF 1.89 M](1050)

Abstract:
The partitioned block frequency domain Kalman filter （PBFDKF） has been applied in acoustic feedback cancellation （AFC） due to its fast convergence and low steady-state misalignment. However， the Kalman filter at steady state might encounter the issue of deadlock when the feedback path experiences abrupt changes， exhibiting suboptimal tracking capabilities. In this paper， the Kalman-filter-based AFC with state detection model （KFSD） is proposed to effectively improve the robustness against abrupt path changes. The narrowband energy of the microphone signal， the residual signal and the update of Kalman filter are used as the input to the state detection model. And then， the state detection results are merged into the state estimation error covariance matrix of the Kalman filter， achieving better re-convergence performance against the abrupt path changes. Experimental results demonstrate the superior performance of the proposed KFSD algorithm， showcasing a high true positive rate， a low false alarm rate， and a short state detection latency. These advantages lead to faster re-convergence and enhanced acoustic feedback cancellation..

21 Research Situation and Prospects of Multi-speaker Separation and Target Speaker Extraction

BAO Changchun , YANG Xue

2024, 39(5):1044-1061. DOI: 10.16337/j.1004-9037.2024.05.002

[Abstract](2536) [HTML](1790) [PDF 2.33 M](1472)

Abstract:
As a cutting-edge technology in speech signal processing， speech separation has significant research value and broad application prospects. Typically， the signal captured by the microphones contains speech signals from multiple speakers， noise and reverberation. To improve the user experience and the performance of backend devices， it is necessary to perform speech separation. Speech separation originated from the well-known cocktail party problem. It aims to separate the speech signals from the mixed signal. In recent years， researchers have proposed a large number of speech separation methods， which have significantly improved separation performance. This paper systematically reviews and summarizes these methods. First， based on whether the auxiliary information of the target speaker is leveraged， speech separation is divided into two categories， i.e.， multi-speaker separation and target speaker extraction. Second， these methods are introduced in detail， following the progression from conventional approaches to deep learning-based techniques. Finally， the existing challenges in speech separation are discussed and prospective research in the future are highlighted.

22 Forged Speech Detection Algorithm Based on Time-Frequency Feature Fusion

YUAN Chengsheng , ZHANG Xueyuan , ZHOU Zhili , LI Xinting , FU Zhangjie

2025, 40(6):1538-1555. DOI: 10.16337/j.1004-9037.2025.06.013

[Abstract](231) [HTML](199) [PDF 3.11 M](528)

Abstract:
To solve the problem of low accuracy and weak generalization of forged speech detection， a new algorithm based on time-frequency feature fusion is proposed. Firstly， in order to excavate the uneven energy distribution of speech fragments or the abnormal fundamental frequency fluctuation， and extract the subtle difference of semantic coherence， a multi-branch feature fusion network is proposed to excavate the difference traces of true and false speech from the pitch， pitch intensity and energy distribution respectively， so as to better represent the frequency change， amplitude change and peak difference of true and false speeches， and improve the accuracy of forged speech detection. Secondly， the classical coordinate attention mechanism fails to effectively mine the fine-grained differences in the time-frequency domain of speech. Therefore， a time-frequency coordinate attention mechanism is proposed to jointly encode the energy distribution and fundamental frequency fluctuation anomalies from the time domain and the frequency domain respectively， so as to better characterize the common high frequency energy anomalies in the spectral graph and improve the generalization of the model. Finally， an adaptive joint loss optimization function is designed to balance the importance of different branch networks to further improve the model’s ability to learn high frequency energy anomalies and semantic incoherence in forged speech. Performance is evaluated on the logical access （LA） dataset of ASVspoof 2019， and experimental results show that compared with the current methods， the proposed method achieves good performance in both EER（Equal error rate） and mint-DCF（Minimum normalized tandem detection cost function） indicators， which decrease by 0.34% and 0.014， respectively. In addition， when dealing with unknown attack A17， which is extremely difficult to detect， it also show good generalization， where EER and mint-DCF decrease by 3.952 2% and 0.136 4， respectively. When dealing with unknown types of spoofing attacks， it also shows better generalization.

23 Polyphonic Sound Event Detection Based on Transfer Learning Convolutional Retentive Network

CHEN Pengfei , XIA Xiuyu

2025, 40(3):730-740. DOI: 10.16337/j.1004-9037.2025.03.013

[Abstract](360) [HTML](328) [PDF 2.23 M](535)

Abstract:
Aiming at the problems of limited strong annotation datasets and the sharp degradation of detection performance in real-world scenarios for polyphonic sound event detection tasks， a method for polyphonic sound event detection based on Transfer learning convolutional retentive network is proposed. Firstly， the method utilizes convolutional blocks with pre-trained weights to extract local features of audio data. Subsequently， the local features， along with orientation features， are input into the residual feature enhancement module for feature fusion and channel dimension reduction. The fused features are then fed into the retentive network with regularization methods to further learn the temporal information in the audio data. Experimental results demonstrate that， compared to the champion system model of the DCASE challenge， the method achieves a reduction in error rates by 0.277 and 0.106， and an increase in F₁ scores by 22.6% and 6.6% on the development and evaluation sets of the DCASE 2016 Task3 dataset， respectively. On the development and evaluation sets of the DCASE 2017 Task3 dataset， the error rates are reduced by 0.22 and 0.123， and the F₁ scores increase by 17.2% and 14.4%， respectively.

24 A Noise Reduction Method of Bird Songs Based on Improved Adaptive Kalman Filtering

WANG Haoran , ZHANG Chun , ZHANG Guohui , WANG Wenzhuo , WANG Nana

2025, 40(6):1568-1580. DOI: 10.16337/j.1004-9037.2025.06.015

[Abstract](222) [HTML](163) [PDF 3.05 M](521)

Abstract:
In island wetlands， the acoustic environment is complex， with various noise sources such as wind， rain， and ocean waves. To effectively address these interferences in bird song processing and improve the accuracy of species identification， a noise reduction method based on adaptive Kalman filtering with linear predictive coding （A-KF-LPC） is proposed to tackle the issue of noise interference in real-time bird song monitoring under complex acoustic conditions in island wetlands. The A-KF-LPC filter enhances stability by weighted filtering bird song signals， while also suppressing noise and providing precise estimations of uncertain small segments within the acoustic signals， progressively approximating the real scenario. Simulations verify the performance of the A-KF-LPC filter， demonstrating its effectiveness in noise reduction. Experimental results show that under different signal to noise ratios （SNRs）， the A-KF-LPC filtering method is more effective in denoising bird songs compared to traditional Kalman filtering and least mean squares （LMS） adaptive filtering methods. Even under conditions where the signal is fully masked by -10 dB noise， the method can still filter out part of the noise. The A-KF-LPC method proposed in this study holds significant application value in the field of acoustic signal processing， offering an efficient and feasible solution for research on bird species in wetland ecosystems， with potential for broader applications.

25 Improved Few-Shot Sound Event Detection Algorithm Based on MAML

CHEN Haojie , YANG Rui , PAN Shanliang

2025, 40(3):741-753. DOI: 10.16337/j.1004-9037.2025.03.014

[Abstract](358) [HTML](349) [PDF 1.84 M](535)

Abstract:
Sound event detection models based on deep learning typically require a substantial mount of labeled data to train from scratch. Access to task-specific data is costly due to restrictions such as data access rights， usage licenses， and the scarcity of rare individual samples. In order to address the challenge of few shot in sound event detection， this paper proposes a model-agnostic and gradient-balanced meta learning algorithm based on model-agnostic meta learning （MAML）. This algorithm trains the model with a large quantities of N-way K-shot tasks， enabling it to acquire the ability of rapid learning， accurately discriminating the unheard sound event in the N-way K-shot target task with minimal gradient updates. In the outer loop stage， the multi-gradient descent algorithm is used to estimate the dynamic loss balance factor， encouraging the model to focus on few-shot training tasks that are more difficult to train， thereby enhancing the shared representation of the model. Furthermore， this paper incorporates data augmentation and label smoothing to mitigate the risk of overfitting caused by the scarcity of training samples. Experimental results demonstrate that the algorithm achieves 73.56%， 82.86% and 57.48% accuracies in the 5-way 1-shot setting on the ESC50， NSynth and DCASE2020 datasets， respectively， showing about 10% relative accuracy improvement compared to the previous MAML algorithm.

26 Sound Event Detection Method Based on Feature Fusion

ZHAO Ming , CHEN Rui

2025, 40(6):1556-1567. DOI: 10.16337/j.1004-9037.2025.06.014

[Abstract](276) [HTML](164) [PDF 1.56 M](490)

Abstract:
Most existing deep learning-based sound event detection methods adopt the conventional 2D convolution. However， its inherent translation invariance property is incompatible with audio signals， rendering the model incompetent in detecting complex sound events. To address the issue， a hybrid convolutional neural network based on feature fusion is proposed. Specifically， by calculating the distribution of the audio spectrogram and adaptively generating convolutional kernels， the proposed model dynamically extracts local features that maintain physical consistency with the audio signal. Meanwhile， the self-attention mechanism is employed in parallel to capture long-distance feature dependencies of the spectrogram. To eliminate the semantic gap between local and global features， a feature fusion module is designed to effectively integrate these two distinct feature representations. Furthermore， to further enhance the detection performance of the proposed model， an improved bidirectional gated recurrent unit based on a multi-scale attention mechanism is proposed to fully refine the fused feature information， which emphasizes event-related frames and suppresses background frames. Experiment results on the DCASE2020 dataset indicate that the proposed model has achieved an F₁-score of 52.57%， which outperforms other existing methods.

27 Analysis on Comprehensive Impact of Contacting Force of Over-Ear Headphones on Noise Reduction and Comfort

CHEN Zihan , YU Guangzheng , WANG Yewei , LI Zhelin

2025, 40(4):986-996. DOI: 10.16337/j.1004-9037.2025.04.012

[Abstract](441) [HTML](468) [PDF 2.86 M](547)

Abstract:
Over-ear headphones （or around-ear headphones） are acoustic wearable devices that directly contact with the surface of the human body. In addition to the shape and material of the earmuffs， the clamping force applied to the earmuffs will directly affect the contacting force on the scalp and the noise attenuation performance， thereby influencing the user’s wearing comfort and hearing comfort. To address the challenge of measuring and evaluating contact pressure in headphone products， a testing device is designed to employ an adjustable clamping force on a subject. In contrast， the contact pressure exerted on the scalp is measured using a pair of pressure-sensitive films. To analyze acoustic parameters during the wearing process， a pair of miniature microphones is positioned at the ear canal entrances to record and analyze the attenuation of binaural noise exposure dose （i.e. noise reduction amount） under different noise environments and various clamping forces. Finally， by incorporating comfort rating scales， the study examines the relationship between objective parameters， including clamping force， contacting force and noise attenuation， and subjective comfort perception. Based on the findings， an appropriate range for clamping force design is suggested. The experimental methodology and relevant conclusions of this study provide a reference for the design and evaluation of clamping force in over-ear headphones.

For Authors

Special issue