基于PLSA主题模型的多标记文本分类
DOI:
作者:
作者单位:

作者简介:

通讯作者:

基金项目:


Multi Label Text Categorization Algorithm Based on Topic Model PLSA
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    为解决多标记文本分类时文本标记关系不明确以及特征维数 过大的问题,提出了基于概率隐语义分析(Probabilistic latent semantic analysis,PL SA)模型的多标记假设重用文本分类算法。该方法首先将训练样本通过PLSA模型映射到隐语 义空间,以文本的主题分布表示一篇文本,在去噪的同时可以大大降低数据维度。在此基础 上利用多标记假设重用算法(Multi label algorithm of hypothesis reuse,MAHR)进行 分类,由于经过PLSA降维后的特征组本身就具有语义信息,因此算法能够精确地挖掘出多标 记之间的关系并用于训练基分类器,从而避免了人为输入标记关系的缺陷。实验验证了该方 法能够充分利用PLSA降维得到的语义信息来改善多标记文本分类的性能。

    Abstract:

    Usually in multi label text classification, the relationship of labels is obscure and the dimension of features is too high. To solve these problem s, a multi label text categorization algorithm called multi label algorithm of hypothe sis reuse based on probabilistic latent semantic analysis (PLSA) is proposed. Fi r stly, the training samples are mapped to a hidden semantic space by PLSA model, using the theme distribution to represent a piece of text, which remov e the noise interference and reduce the data dimension significantly. Then, the m ulti label algorithm of hypothesis reuse (MAHR) is utilized to classify samples . The features obtained from PLSA dimension reduction have the semantic informat ion. Therefore, the relationship of labels can be obtained accurately to train t he ba se classifier, and the artificial defect is thus avoided. Experimental results d em onstrate that the proposed method can make full use of the semantic information by PLSA dimension reduction and improve the performance of multi label text cl assification.

    参考文献
    相似文献
    引证文献
引用本文

蒋铭初;潘志松;尤峻.基于PLSA主题模型的多标记文本分类[J].数据采集与处理,2016,31(3):541-547

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2016-06-24