Volume 29,Issue 2,2014 Table of Contents

Study on Key Technologies in Practical Speech Emotion Recognition

2014, 29(2):157-170.

Abstract (1063) HTML (0) PDF 1.30 M (2642) Comment (0) Favorites

Abstract:In this paper we introduce the latest and future progress in speech emotion recognition, especially on the practical speech emotion research considering the real world applications. We mainly discuss: the history and development of affective computing research, the practical applications in affective computing, and review the general speech emotion recognition methods, including emotion model, emotion database, feature extraction and emotion recognition algorithm. Consider the needs in real world applications, we focus on the key methods in practical speech emotion research. We analysis the current challenges in practical speech emotion recognition, especially for the fidgetiness emotion, we review the methods in database establishment, feature analysis and modeling techniques. Finally, we give the outlook of future speech emotion research, and discuss the future challenges and possible solutions.

Deep Speech Signal and Information Processing: Research Progress and Prospect

Dai Li-Rong , Zhang Shi-Liang

2014, 29(2):171-179.

Abstract (1835) HTML (0) PDF 1.08 M (1739) Comment (0) Favorites

Abstract:In this paper, deep learning is briefly introduced at first. Then, a review on the research progress of deep speech signal and information processing is provided along the main research branches including speech recognition, speech synthesis and speech enhancement. For speech recognition, the acoustic modeling methods based on deep neural network(DNN), DNN model training technologies for big speech data and DNN speaker adaptation methods are introduced. For speech synthesis, several speech synthesis methods based on models in deep learning are summarized. For speech enhancement, a couple of typical DNN based speech enhancement frameworks are presented. Finally, the possible future research points of deep speech signal and information processing are discussed.

Measurement and Analysis of Structure Head-related Transfer Function

Wu Xihong , Lv Zhenyang , Gao Yuan , Qu Tianshu

2014, 29(2):180-185.

Abstract (734) HTML (0) PDF 1.09 M (1515) Comment (0) Favorites

Abstract:In this work, the structural head-related transfer functions with high spatial resolution were measured using spark gap generator. The structural head related transfer functions include head transfer function, head and pinna transfer function and head and torso transfer function. Based on the measured structural transfer functions, the effects of head, pinnae and torso on the total head related transfer function have been analyzed. Using the measured structural head related transfer functions and the total head related transfer functions, the pinna transfer functions were calculated using two different methods. The results show that the two pinna transfer functions are highly correlated in frequency domain. That means, under some conditions, the head related transfer function can be seen as the summation of head transfer function, torso transfer function and pinna transfer function.

A Coordinate Transformation Parallel Soft Switching Blind Equalization Algorithm and Its DSP implementation

Guo Yecai

2014, 29(2):186-190.

Abstract (682) HTML (0) PDF 934.27 K (1452) Comment (0) Favorites

Abstract:Aiming at the defects of the large mean square error or failure in equalizing higher-order non-constant modulus signals for Super-Exponential Iteration blind equalization algorithm(SEI), a parallel soft switching Coordinate Transformation Super-Exponential Iterative Decision Directed blind equalization algorithm (CTSEI-DD) was proposed. In this proposed algorithm, coordinate transformation method is introduced into the Super-Exponential Iterative blind equalization algorithm to obtain Coordinate Transformation Super-Exponential Iterative blind equalization algorithm(CTSEI), CTSEI is combined with the Decision Directed algorithm in soft switching way. Accordingly, the proposed algorithm has many advantages such as fast convergent rate, small mean square error, and the high performance in equalizing higher order QAM signals. On the basis of testing the performance of the proposed algorithm and obtaining its relative parameters, the codes are written in C language in Code Composer Studio(CCS) integrated development environment software and debugged, the proposed algorithm is implemented on the Digital Signal Processor(DSP).

A Least Square Approach to the Design of Frequency Invariant Beamformers with Sparse Tap Coefficients

Chen Huawei , Wang Tiannan , ZHANG Feng , HE Saijuan

2014, 29(2):191-197.

Abstract (933) HTML (0) PDF 4.27 M (1386) Comment (0) Favorites

Abstract:The frequency invariant beamformer (FIB) is of great interest in practice for distortionless broadband audio signal acquisition and processing. One of the representative approaches recently proposed for FIB design is the least squares method based on the spatial response variation (LS-SRV). It is noted that the performance of the LS-SRV is dependent on the tap length of FIR. By increasing the FIR tap length, the FIB performance can be effectively improved. However, this is at the cost of greater implementation complexity. To combat this problem, we proposed in this paper an improved design scheme with sparse FIR tap coefficients. Our proposed approach is based on the iterative reweighted minimization from signal sparse representation. The efficacy of the proposed method has been evaluated by design examples.

ROBUST SPEAKER RECOGNITION BASED ON SPARSE CODEING

HE Yongjun , SUN Guanglu , FU Maoguo , HAN Jiqing

2014, 29(2):198-203.

Abstract (855) HTML (0) PDF 409.82 K (1274) Comment (0) Favorites

Abstract:Speaker recognition suffers severe performance degradation under noisy environments. To solve this problem, we propose a novel method based on morphological component analysis. This method employs a universal background dictionary (UBD) to model common variability of all speakers, a speaker dictionary to model special variability of each speaker and a noise dictionary to model variability of environmental noise. These three dictionaries are concatenated to be a big dictionary, over which test speech is sparsely represented and classified. To improve the discriminability of speaker dictionaries, we optimize the speaker dictionaries by removing speaker atoms which are close to the UBD atoms. To ensure the varied noises can be tracked, we design an algorithm to update the noise dictionary with the noisy speech. We also conduct experiments under various noise conditions and the results show that the proposed method can obviously improve the robustness of speaker recognition under noisy environments.

Research on HMM-based Articulatory Movement Prediction for Chinese

Cai Ming-Qi , Ling Zhen-Hua , Dai Li-Rong

2014, 29(2):204-210.

Abstract (929) HTML (0) PDF 871.89 K (1313) Comment (0) Favorites

Abstract:Articulatory features represent the quantitative positions and continuous movements of articulators during the production of speech. These articulators include the tongue, lips, jaw, velum and so on. This paper presents an investigation into articulatory feature prediction for Chinese when text and audio inputs are given. First, a method of recording and preprocessing articulatory features captured by electromagnetic articulography (EMA) is designed. By head movement and occlusal surface normalization, the reliability of articulatory features is guaranteed. Then, unified acoustic-articulatory hidden Markov models (HMMs) are introduced to predict Chinese articulatory features and achieve the inversion mapping from acoustic to articulatory features. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the difference among model clustering methods and the influence of cross-stream dependency modeling. The results show that best performance is achieved using unified acoustic-articulatory triphone HMMs with separate clustering of acoustic and articulatory model parameters and a dependent-feature model structure.

Clipping restoration of audio signals based on kernel Fisher discriminant and weighted codebook mapping

dengfeng , baofeng

2014, 29(2):211-221.

Abstract (721) HTML (0) PDF 1.89 M (1231) Comment (0) Favorites

Abstract:In this paper, the clipping restoration of audio signals based on kernel Fisher discriminant (KFD) and the weighted codebook mapping (WCBM) in Modified Discrete Cosine Transform (MDCT) domain is proposed. First, the four clipping features are extracted according to the MDCT coefficients of audio signals. Second, the clipping feature parameters are used to train the optimal kernel Fisher classifier which is employed to detect the clipping. Finally, the WCBM of sub-band envelop is adopted to restore the clipping of audio signals. The test results indicate that the proposed algorithm can effectively remove the clipping distortion of audio signals and obviously outperforms the existing clipping restoration methods.

A Self-Learning Approach for Monaural Speech Enhancement Based on Sparse and Low-Rank Matrix Decomposition

Li Yinan , Jia Chong , Yang Jibin , Wu Haijia , Zhang Liwei

2014, 29(2):223-226.

Abstract (1081) HTML (0) PDF 1.59 M (2416) Comment (0) Favorites

Abstract:To resolve the prior dependency of existing enhancement algorithms based on dictionary learning, an unsupervised self-learning approach for speech enhancement in one channel record is presented. Firstly, the algorithm decomposes the magnitude spectrum of noisy speech efficiently into low-rank part, sparse part and noise part. Then, the dictionary of noise is acquired by learning the low-rank part. Finally, the clean speech is separated by using the acquired noise dictionary and multiplicative update rules. As the approach is unsupervised, it is more convenient and practice than other enhancement methods based on dictionary learning. The experiment results show that the approach proposed outperforms other enhancement methods like robust principal component analysis and multiband spectra subtraction in terms of harmonic structure maintaining and noise suppression.

Voice conversion based on a mixed model GMM and ANN

Shaoqin Yao , Zhang Linhua

2014, 29(2):227-231.

Abstract (794) HTML (0) PDF 613.52 K (1138) Comment (0) Favorites

Abstract:In this paper, as the mean vector of GMM parameters can represent the basic shapes of converted feature vectors, a novel mixed model comprised of GMM and ANN spectral conversion method is proposed to alleviate the over-smoothing problem by using ANN to transform the mean vector of GMM parameters. Not only static but also dynamic spectral features are used for approaching converted spectrum sequence in order to gain the continuous converted spectral. Moreover, as pitch is very important to voice conversion, F0 is also analyzed and transformed on the basis of spectral conversion. The performance of the proposed method is evaluated using subjective and objective tests, and the results show that the proposed method can obtain a better speech quality than the earlier voice conversion system based on conventional GMM method.

Speech enhancement algorithm based on adapted Super-Gaussian mixture model

Zhao Gaihua

2014, 29(2):232-237.

Abstract (866) HTML (0) PDF 671.36 K (1220) Comment (0) Favorites

Abstract:Abstract: The observation of speech spectral structure shows that the statistics of speech signal cannot be well determined by a simple probability density function. Therefore, this paper presents a speech enhancement algorithm based on super-Gaussian mixture model. Firstly, the super Gaussian mixture model is employed to model the speech spectral amplitude, which is more flexible in capturing the statistical behavior of speech signals than the conventional simple speech model. Where after, the PDF and weight of the mixture component are further adapted, which can overcome the disadvantage that the traditional simple speech model cannot well track the dynamic characteristics of the speech signal. The simulation results show that the proposed algorithm achieves better noise suppression and lower speech distortion compared to the conventional short-time spectral estimation algorithms.

Speech Emotion Recognition Based on KPCA and CCA

Bian Jinhong , Wang Jilin , YU Weifeng , Zhao Li

2014, 29(2):238-242.

Abstract (734) HTML (0) PDF 381.70 K (1528) Comment (0) Favorites

Abstract:In this paper, we propose the KPCA method for speech emotion recognition, and the KPCA combined with CCA method is also proposed for emotion recognition. Compare with the traditional PCA method, the results show that the emotion recognition based on KPCA and KPCA CCA have better performance.

An approach of Speaker Verification Based on Supervector Clustering With Poor Corpus

Hua Cheng , Li Hui

2014, 29(2):243-247.

Abstract (623) HTML (0) PDF 371.00 K (1688) Comment (0) Favorites

Abstract:Previous feature mapping needs a lot of corpus with channel flags. Recently unsupervised clustering on channels also needs a series of speech recorded under different channels. This paper discusses a new speaker verification method based on supervector clustering, in order to ensure the performance and reduce the data requirements. An approach based on supervector clustering under poor training corpus using the inter-speaker variability between male and female is presented. Mixed effects of speaker and channel information are clustered, then after the decision on categories of unprocessed speech feature mapping is conducted. Experiments show advantages compared with other methods under poor corpus，from corpus and performance perspective.

Nonlinear Analysis of Audio Signals Using the Surrogate Data Test

Liu Xin , Bao Changchun

2014, 29(2):248-253.

Abstract (813) HTML (0) PDF 381.68 K (1371) Comment (0) Favorites

Abstract:A nonlinear analysis method using the surrogate data test is proposed for noisy audio signals. Firstly, several groups of surrogate data are generated according to the linear hypothesis of audio signals. Then, the kurtosis measures of the original audio and surrogate data are calculated, respectively. Finally, the nonlinear components in the original audio are detected based on hypothesis test. Experiment results show that the proposed method proves the nonlinear nature of audio signals, and achieves the better performance for differentiating between audio signals and noise in comparison with the nonlinear analysis based on the largest Lyapunov exponent.

The Splicing Feature Extraction and Analysis based on Fractional Cepstrum Transform in Voice Forensics

Zhong Wei , You Xingang , Wang Bo

2014, 29(2):254-258.

Abstract (688) HTML (0) PDF 2.93 M (1535) Comment (0) Favorites

Abstract:On the research of voice forensics, this article conducts primary study on voice splicing recognition of the same sampling rate ,focusing on analyzing the influence of stitching on the noise characteristics. In addition, this paper also presents algorithm for mosaic frame detection based on fractional cepstrum transform(FRCT), proposing a model of voice stitching and joint identification . Experimental results show that the cross-zero ratio performance of the FRCT method is much better than the MFCC in noise moment when fractional factor a is about 0.2.Furthermore, when fractional factor a is about 1.2 , the high frequency variance performance of the FRCT method is also better than the MFCC. The program has higher value and broad application prospects in the field of voice forensics.

A Neural Network Speech Watermarking Method Based on Short-term Energy and Least Relative Mean Square Error Criterion

Hao Huan , Cheng Liang , Zhang Yipeng

2014, 29(2):259-264.

Abstract (619) HTML (0) PDF 432.65 K (1406) Comment (0) Favorites

Abstract:In order to overcome the weakness of least mean square error (LMS) and the recursive least squares(RLS), a new neural network speech watermarking method based on short-term energy and least relative mean square error(LRMS) was proposed in this paper. First and foremost, a synchronization sequence was embedded into the first frame of the speech. In addition, calculated the short-term energy of each frame and performed DWT(discrete wavelet transform) for the speech frame larger than the threshold. At last, the watermark was embedded and extracted via the trained LRMS based neural network. The balance of the watermarking capacity and robustness was achieved by setting a reasonable short-term energy threshold and the network converged fast by LM algorithm. The theoretical analysis and the experimental results show that, compared with [8], the improved neural network scheme converges faster and gets better robustness against attacks such as additive noise, low-pass filtering, re-sampling, re-quantifying, et al. Moreover, the performance achieves 5% increase on average.

Speech Enhancement Based on Convolutive Nonnegative Matrix Factorization with Sparseness Constraints

Zhang Liwei

2014, 29(2):265-273.

Abstract (757) HTML (0) PDF 952.68 K (1748) Comment (0) Favorites

Abstract:It’s always been a problem to improve the quality of enhanced speech in non-stationary noise and low SNR for speech enhancement research. In recent years, Convolutive Nonnegative Matrix Factorization algorithm has been well used for speech enhancement. Considering the sparsity of speech signals in the frequency domain, a speech enhancement method based on Sparse Convolutive Nonnegative Matrix Factorization(SCNMF) is proposed. Our method for speech enhancement consists of a training stage and a denoising stage. During the training stage, we model the prior information about the spectrum of speech and noise by SCNMF algorithm and the dictionary of speech and noise is constructed. During the denoising stage, the spectrum of noisy speech is analyzed by SCNMF algorithm, then, we use the dictionary of speech and noise to evaluate the coding matrix of speech, and reconstruct the enhanced speech. The impact of sparse factor on enhanced speech quality is analyzed through simulation experiments. Experimental results show that the proposed method outperforms traditional speech enhancement algorithms, such as MSS, NMF, CNMF, in non-stationary noise and low SNR.

Fast Query-by-Example Spoken Term Detection Using SegmentalDynamic Time Warping

Feng Zhiyuan , Zhang Lianhai

2014, 29(2):274-279.

Abstract (652) HTML (0) PDF 831.46 K (1646) Comment (0) Favorites

Abstract:This paper presents a method of query-by-example spoken term detection(QbE STD) using segmental dynamic time warping(SDTW) and lower-bound estimate(LBE). The approach is designed for low-resource situations in which limited or no in-domain training material is available. According to this method, the phone posterior probabilities of query examples and test materials should first of all be got, and then the candidate segments are selected in test materials and the lower-bound estimates of actual DTW scores are computed between the query example and all candidate segments in test materials quickly. the K nearest neighbor (KNN) search algorithm is chosen to search for the segments that have maximal similarity. Finally, the retrieval results can be modified by pseudo relevance feedback(PRF). The experimental result indicates that although there is a slightly degraded in retrieval precision when compared with formulating a DTW procedure directly, the retrieval speed of the method presented by this paper has a big advantage over the latter, and the retrieval precision can be enhanced availably after the retrieval results modified by PRF. .

Multi-streamed based out of vocabulary terms detection

Xiong ShiFu , Guo Wu

2014, 29(2):280-285.

Abstract (642) HTML (0) PDF 425.75 K (1218) Comment (0) Favorites

Abstract:Abstract: For out of vocabulary (OOV) terms detection in spoken term detection (STD), we propose a multi-streamed based detection algorithm which makes use of three sub-word units: phone, syllable

Incorporating Query Expansion into Dynamic Match for Out-Of-Vocabulary Word Detection

Zheng Yongjun , Zhang Lianhai

2014, 29(2):286-292.

Abstract (824) HTML (0) PDF 428.49 K (1387) Comment (0) Favorites

Abstract:Nowadays, one of the challenges of keyword spotting is the issue of out-of-vocabulary (OOV) word. The detection performance for OOV word is considerably worse than for in-vocabulary (INV) word due to its high degree of uncertainty in pronunciation. This paper presents a method to improve the OOV word detection performance by incorporating query expansion into dynamic match. We initially compare the joint-multigram model based query expansion and the minimum edit distance (MED) based dynamic match for OOV word. Considering the potential mutual complementarity between them, we propose two methods of fusion. One is result fusion: performing a parallel OOV word detection with query expansion and dynamic match individually and then merging search results of the two systems. Another is confidence fusion: combining MED and the pronunciation score together as a hybrid confidence measure to implement OOV word detection and verification. Tests show that the second fusion method is more efficient and the figure of merit achieves 19.8% improvement relatively.

Endpoint Detection of Noise-Corrupted Speech Time-Frequency Characteristics Based on Wavelet Packet Decomposition

Chen Jinlong , Fan Yingle , Ni Hongxia , Wu Wei

2014, 29(2):293-297.

Abstract (835) HTML (0) PDF 1.27 M (1576) Comment (0) Favorites

Abstract:To overcome the problem of mode mixing for Hilbert-Huang Transform (HHT) in speech processing, a new method of time-frequency analysis based on Wavelet Packet Decomposition (WPD) is proposed in this paper. Firstly, noise-corrupted speech is decomposed by using WPD, each component is carried out Empirical Mode Decomposition (EMD) separately, and the Intrinsic Mode Function (IMF) is selected by using correlation threshold criterion. Then, the Hilbert spectrum and instantaneous energy spectrum of speech signal are achieved. Finally, the method of instantaneous energy spectrum based on WPD is applied to noise-corrupted speech endpoint detection. Experimental results indicate that the proposed method is more accurate、robust and self-adaptive by comparison with the original generalized dimension(OGD) and the spectral entropy(SE) algorithms. The proposed method can effectively describe the time-frequency characteristics of the non-linear and non-stationary speech signal, and has provided a new idea for the research of speech signal.

A Zero Crossing Algorithm for Time Delay Estimation

Zhao Shen , Zhou Chao

2014, 29(2):298-303.

Abstract (919) HTML (0) PDF 533.63 K (1944) Comment (0) Favorites

Abstract:Time delay estimation (TDE) is one of the critical technology in radar and sonar system. This paper proposed a new algorithm for TDE based on zero crossing (ZC). The basic model of zero crossing TDE is discussed, and the mean squared error with additive white Gaussian noise is derived by a theoretic equation. It is shown that the accuracy of TDE is a function of signal-to-noise ratio (SNR), signal frequency and numbers of ZC points. The ZC algorithm approaches a comparable accuracy with FFT method for high SNR, though with low computational complexity and low processing latency. The proposed zero crossing estimator is suitable for real-time applications.

An Improved Algorithm for Pitch Period Detection

Zhao Yi

2014, 29(2):304-308.

Abstract (683) HTML (0) PDF 837.88 K (1928) Comment (0) Favorites

Abstract:The extraction of pitch period has a wide range of applications in the field of speech signal processing. Inspired by traditional autocorrelation algorithm and the pitch detection method that used in Multi-Band Excitation (MBE) vocoder, we have put forward an improved algorithm for pitch period extraction. This algorithm has five parts: Pre-process, pitch rough estimation in time domain, pitch smoothing, search with Time Variable Filter, decimal pitch estimation. Experimental results show that this new algorithm can achieve higher accuracy and compared with traditional autocorrelation algorithm, this approach owns a better noise immunity.

Doppler shift correction for wayside acoustic signals

Wu Qiang , He Qinbo , Kong Fanrang

2014, 29(2):309-315.

Abstract (951) HTML (0) PDF 523.60 K (1324) Comment (0) Favorites

Abstract:To on-linely diagnose faults of train bearings, wayside acoustic signal analysis is one of the most important development directions. However, the Doppler Effect due to relative motion between microphone and acoustic source would cause the acoustic signal spectrum being distorted, which is not beneficial to accurately reflect equipment health conditions. In order to solve the doppler effect of acquisition signal spectrum distortion and accurately restore the original signal spectrum structure, the paper proposes a signal variable sampling method based on frequency shift ratio, which can effectively solve the above problem of distorted spectrum for acoustic signals with the Doppler Effect. Firstly, the curve of frequency shift was figured out with known conditions under measurement; secondly, the frequency shift ratio of every sampling point was obtained based on the frequency shift curve; the last, using variable sampling technology, the new sampling signal was obtained by the interpolation method. The paper improves the resampling method based on frequency shift curve, which has been proposed by the authors previously. The effectiveness of the proposed method is verified by simulation and experiment.

Comparison of Four Typical Clearness Methods for Beamforming Acoustic Source Identification

Yang Yang , Chu Zhigang

2014, 29(2):316-326.

Abstract (730) HTML (0) PDF 3.52 M (1487) Comment (0) Favorites

Abstract:In order to utilize the clearness methods for beamforming acoustic source identification exactly, imaging diagrams of the given single source, incoherent sources and coherent sources, together with corresponding performance curves were simulated and the loudspeaker sound source identification experiments were conducted. The characteristics of DAMAS2, FFT-NNLS, CLEAN and CLEAN-SC were demonstrated and compared with each other. Three conclusions were drawn. Firstly, for single source or incoherent sources, all these four methods could not only suppress sidelobes effectively but also improve resolution remarkably and CLEAN-SC has the highest accuracy rating especially. Secondly, DAMAS2 and FFT-NNLS have high accuracy rating for coherent sources, while CLEAN-SC couldn’t identify coherent sources. Thirdly, DAMAS2 has the highest computational efficiency, FFT-NNLS follows, CLEAN and CLEAN-SC are slightly slow. These conclusions have guiding significance on the exact application of these methods in practical engineering.

Research on the Influence of Transfer Function and Plane Wave Incidence Angle on Synthesized Sound Field

Peng Changyou , Huang Qinghua

2014, 29(2):327-332.

Abstract (701) HTML (0) PDF 1.16 M (1512) Comment (0) Favorites

Abstract:The performance of synthesized plane wave sound field is affected by the evanescent contribution of the transfer function and incidence angle of plane wave. In this paper, transfer function is modified by using the rectangular window in wave number domain to eliminate the evanescent sound field. With fixed plane wave incidence angle and spatial sampling interval of secondary sources, the influence of windowed transfer function and plane wave incidence angle on synthesized plane wave sound field were analyzed under the anti-aliasing condition. Simulation results demonstrate that modified transfer function can improve the performance of the synthesized plane wave sound field when the plane wave frequency is smaller than anti-aliasing frequency for a given plane wave incidence angle.

For Authors

Quick search

Volume retrieval

External Links