MSDL-IEW:面向文本分类的密集度感知主动学习算法
作者:
作者单位:

1.南京理工大学计算机科学与工程学院, 南京 210094;2.中电科大数据研究院有限公司, 贵阳 550022;3.提升政府治理能力大数据应用技术国家工程实验室, 贵阳 550022;4.南京供电公司, 南京 210000;5.中国电子科技网络信息安全有限公司, 成都 610041

作者简介:

通讯作者:

基金项目:

国家自然科学基金(61941113)资助项目;中央高校基本科研业务费专项(30916011328, 30918015103)资助项目;南京市科技计划(201805036)资助项目;提升政府治理能力大数据应用技术国家工程实验室开放基金资助项目。


MSDL-IEW: Active Learning Algorithm for Text Classification Based on Density Perception
Author:
Affiliation:

1.School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China;2.CETC Big Data Research Institute Co Ltd, Guiyang 550022,China;3.Big Data Application on Improving Government Governance Capabilities National Engineering Laboratory, Guiyang 550022, China;4.Nanjing Power Supply Company, Nanjing 210000, China;5.China Electronics Technology Cyber Security Co Ltd, Chengdu 610041, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    为了解决文本分类任务中未标注数据无法即时标注及成本过高的问题,提出一种面向文本分类的不确定性主动学习方法。提出MSDL(Measure sample density by LDA)算法对未标注样本密集度进行计算,引入新的度量样本聚集情况的密集度计算方式,在密集度高的样本区域选取初始训练集样本,从而使初始训练集更具代表性;从未标注样本中选取更具不确定性的样本加入到训练集中,并基于信息熵对样本进行加权训练,迭代更新分类器模型,直至达到预期终止条件。实验结果表明,在文本分类任务中,该方法相较于其他传统主动学习算法性能更优。

    Abstract:

    To solve the problem that the unlabeled data in the text classification task cannot be immediately marked and the cost is too high, this paper proposes an active learning method for uncertainty based on text classification. The MSDL (Measure sample density by LDA) algorithm is proposed to calculate the unlabeled sample density, and the new metric sample aggregation situation is introduced. The initial training set sample is selected in the densely sampled region, thus making the initial The training set is more representative. The more uncertain samples from the unlabeled samples are added to the training set, the samples are weighted based on the information entropy, and the classifier model is iteratively updated until the expected termination condition is reached. Experimental results show that this method is better than other traditional active learning algorithms in text classification tasks.

    表 2 标注样本为70%时各算法的实验结果对比Table 2 Comparison of experimental results of each algorithm when the labeled sample is 70%
    图1 MSDL-IEW主动学习算法框架Fig.1 MSDL-IEW active learning algorithm model
    图2 已标注样本数量对查准率、召回率、F值的影响Fig.2 Influence of the number of labeled samples on the precision, recall rate and F value
    参考文献
    相似文献
    引证文献
引用本文

TRAN Baphan,马菲菲,明晶晶,余秦勇,杨辉,李全兵,王永利. MSDL-IEW:面向文本分类的密集度感知主动学习算法[J].数据采集与处理,2021,36(2):240-247

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2020-06-04
  • 最后修改日期:2020-11-29
  • 录用日期:
  • 在线发布日期: 2021-03-25