基于特征词向量的短文本聚类算法
作者:
作者单位:

作者简介:

通讯作者:

基金项目:


Short Text Clustering Based on Feature Word Embedding
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    针对互联网短文本特征稀疏和速度更新快而导致的短文本聚类性能较差的问题,本文提出了一种基于特征词向量的短文本聚类算法。首先,定义基于词性和词长度加权的特征词提取公式并提取特征词代表短文本;然后,使用Skip-gram模型(Continous skip-gram model)在大规模语料中训练得到表示特征词语义的词向量;最后,引入词语游走距离(Word mover′s distance,WMD)来计算短文本间的相似度并将其应用到层次聚类算法中实现短文本聚类。在4个测试数据集上的评测结果表明,本文方法的效果明显优于传统的聚类算法,平均F值较次优结果提高了56.41%。

    Abstract:

    Aiming at the problem of poor clustering performance for short text caused by sparse feature and quick updating of short text on the internet, a short text clustering algorithm based on feature word embedding is proposed in this paper. Firstly, the formula for feature word extraction based on word part-of-speech(POS) and length weighting is defined and used to extract feature words as short texts. Secondly, the word embedding that represents semantics of the feature word is gained by means of the training in large scale corpus with continous skip-gram model. Finally, word mover′s distance is introduced to calculate the similarity between short texts and applied in the hierarchical clustering algorithm to realize the short text clustering. The evaluation results on four testing datasets show that the proposed algorithm is significantly superior to traditional clustering algorithms with the mean F of 58.97% higher than the secondly best result.

    参考文献
    相似文献
    引证文献
引用本文

刘欣佘贤栋唐永旺王波.基于特征词向量的短文本聚类算法[J].数据采集与处理,2017,32(5):1052-1060

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2018-04-10