Abstract:Articulatory features represent the quantitative positions and continuous movements of articulators during the production of speech. These articulators include the tongue, lips, jaw, velum and so on. This paper presents an investigation into articulatory feature prediction for Chinese when text and audio inputs are given. First, a method of recording and preprocessing articulatory features captured by electromagnetic articulography (EMA) is designed. By head movement and occlusal surface normalization, the reliability of articulatory features is guaranteed. Then, unified acoustic-articulatory hidden Markov models (HMMs) are introduced to predict Chinese articulatory features and achieve the inversion mapping from acoustic to articulatory features. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the difference among model clustering methods and the influence of cross-stream dependency modeling. The results show that best performance is achieved using unified acoustic-articulatory triphone HMMs with separate clustering of acoustic and articulatory model parameters and a dependent-feature model structure.