Abstract:A study of acoustic segment models(ASMs) for unsupervised query-by-example spoken term detection is presented. Firsty, a Gaussian mixture model(GMM) is trained without any transcription information to label speech frames with Gaussian posteriorgram. Hierarchical agglomerative clustering is used to decompose the posterior features into acoustically exhibiting segments. A label is assigned to each result segment by k-means clustering, then posteriorgram is faciltitated to train ASMs. In query matching phase, Viterbi decode is proposed to represent query and test posteriorgrams as ASM sequences. Dynamic match lattice spotting based on minimum edit distance is used to locate possible occurrences of the query term. Experimental results show that the proposed method outperforms traditional GMM and ASMs tokenizers.