Vietnamese Word Segmentation with Conditional Random Fields and Ambiguity Model

doi:10.16337/j.1004-9037.2017.03.024

Home > Archive>Volume 32, Issue 3, 2017 >636-642. DOI:10.16337/j.1004-9037.2017.03.024

Vietnamese Word Segmentation with Conditional Random Fields and Ambiguity Model
DOI:
                        10.16337/j.1004-9037.2017.03.024
                    
CSTR:
                        
Author:
                        
Affiliation:
Clc Number:
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

The Vietnamese lexical features are discussed and essential characteristics of Vietnamese are integrated into condition random fields (CRFs) to propose a Vietnamese word segmentation method based on CRFs and ambiguity model. The segmentation corpus consisting of 25 981 Vietnamese is obtained as a training corpus of CRFs by computer marking and artificial proofreading. Vietnamese crossing ambiguity is widely distributed in the sentence. To eliminate the effects of crossing ambiguity, 5 377 ambiguity fragments are extracted from training corpus through dictionary of the forward and reverse matching algorithm. An ambiguity model is obtained by training the maximum entropy model. Then they are both incorparted into the segmentation model. The training corpus is divided into ten copies evenly for cross validation experiments. The segmentation accuracy reaches 96.55% in the experiment. Experimental results show that the method improves the segmentation accuracy rate, the recall rate and the F value of Vietnamese word obviously, compared with Vietnamese segmentation tool VnTokenizer.

Reference

Cited by

Get Citation

Xiong Mingming, Li Ying, Guo Jianyi, Mao Cunli, Yu Zhengtao. Vietnamese Word Segmentation with Conditional Random Fields and Ambiguity Model[J]. Journal of Data Acquisition and Processing,2017,32(3):636-642.

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:
Revised:
Adopted:
Online: June 28,2017
Published:

For Authors

Get Citation

Related Videos

Share

Article Metrics

History

Article QR Code