面向文本分类的有监督显式语义表示
作者:
作者单位:

作者简介:

通讯作者:

基金项目:


Supervised Explicit Semantic Representation for Text Categorization
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    文本表示作为文本分类的一个基本问题,一直广受关注。目前文本表示主要有词袋模型、隐式语义表达和基于知识库的显式语义表达3种方式。本文首先分析对比了这3种文本表示方式在文本分类中的效果。实验发现,基于知识库的显式语义表达并没有如预期一样提高文本分类的效果。经分析,其原因在于显式语义表达在扩展文档表达时易引入噪声。针对该问题,本文提出了一种有监督的显式语义表达方法。该方法利用数据集的标注信息识别文档中与分类最相关的核心概念,并扩展核心概念以形成文档显式语义表达。3个标准分类数据集上的结果证实了本文所提文本表示方法的有效性。

    Abstract:

    As a fundamental problem of text categorization, text representation is widely concerned. Currently, there are three main ways of text representation: bag-of-words model, latent semantic representation and knowledge-based explicit semantic representation. The paper analyzes and compared the effects of these methods applied to text categorization. Experiments show that the knowledge-based explicit semantic representation cannot improve the text categorization performance as expected. To tackle the problem that the knowledge-based explicit semantic representation easily introduces noise in extending text, a supervised explicit semantic representation method is proposed. The dataset label information is used to identify the most relevant concepts in document and the document is represented in explicit semantic based on expanding those key concepts. The results of three datasets confirm the effectiveness of the proposed method.

    参考文献
    相似文献
    引证文献
引用本文

孙飞 郭嘉丰 兰艳艳 程学旗.面向文本分类的有监督显式语义表示[J].数据采集与处理,2017,32(3):550-558

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2017-06-28