MSDL-IEW:面向文本分类的密集度感知主动学习算法

doi:10.16337/j.1004-9037.2021.02.005

首页 > 按月查看>2021年第2月 >240-247. DOI:10.16337/j.1004-9037.2021.02.005

MSDL-IEW:面向文本分类的密集度感知主动学习算法
DOI:
                        10.16337/j.1004-9037.2021.02.005
                    
作者:
                        
                        
                    
作者单位:1.南京理工大学计算机科学与工程学院， 南京 210094;2.中电科大数据研究院有限公司， 贵阳 550022;3.提升政府治理能力大数据应用技术国家工程实验室， 贵阳 550022;4.南京供电公司， 南京 210000;5.中国电子科技网络信息安全有限公司， 成都 610041
作者简介:
通讯作者:
基金项目:国家自然科学基金（61941113）资助项目；中央高校基本科研业务费专项（30916011328, 30918015103）资助项目；南京市科技计划（201805036）资助项目；提升政府治理能力大数据应用技术国家工程实验室开放基金资助项目。

MSDL-IEW: Active Learning Algorithm for Text Classification Based on Density Perception

Author:

Affiliation:

1.School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China;2.CETC Big Data Research Institute Co Ltd, Guiyang 550022,China;3.Big Data Application on Improving Government Governance Capabilities National Engineering Laboratory, Guiyang 550022, China;4.Nanjing Power Supply Company, Nanjing 210000, China;5.China Electronics Technology Cyber Security Co Ltd, Chengdu 610041, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

为了解决文本分类任务中未标注数据无法即时标注及成本过高的问题，提出一种面向文本分类的不确定性主动学习方法。提出MSDL（Measure sample density by LDA）算法对未标注样本密集度进行计算，引入新的度量样本聚集情况的密集度计算方式，在密集度高的样本区域选取初始训练集样本，从而使初始训练集更具代表性；从未标注样本中选取更具不确定性的样本加入到训练集中，并基于信息熵对样本进行加权训练，迭代更新分类器模型，直至达到预期终止条件。实验结果表明，在文本分类任务中，该方法相较于其他传统主动学习算法性能更优。

Abstract:

To solve the problem that the unlabeled data in the text classification task cannot be immediately marked and the cost is too high， this paper proposes an active learning method for uncertainty based on text classification. The MSDL （Measure sample density by LDA） algorithm is proposed to calculate the unlabeled sample density， and the new metric sample aggregation situation is introduced. The initial training set sample is selected in the densely sampled region， thus making the initial The training set is more representative. The more uncertain samples from the unlabeled samples are added to the training set， the samples are weighted based on the information entropy， and the classifier model is iteratively updated until the expected termination condition is reached. Experimental results show that this method is better than other traditional active learning algorithms in text classification tasks.