Abstract:A key problem of language identification (LID) is how to design effective representations which are specific to language information. Recent advances in deep neural networks (DNNs) have led to significant improvements in language identification. The acoustic feature extracted from a structured DNN which is discriminative to phoneme or tri-phone states can significantly improve the performance. End-to-end schemes also show its strong capability of modelling in recent years. A novel end-to-end convolutional neural network (CNN) LID system is proposed, called language identification network (LID-net), taking advantage of neural networks (NNs) with the capability in feature extraction and discriminative modelling, which can extract units that discriminant to languages, and we call them LID-senones, thus can extract effective utterance representation with pooling layer. Evaluations on NIST LRE 2009 show improved performance compared to current state-of-the-art deep bottleneck feature with total variability (DBF-TV) method, can achieve 1.35%, 12.79% and 29.84% relative equal error rate (EER) improvement on 30, 10 and 3 s utterances and receive over 30% relative gain in Cavg on all durations.