基于特征扩展的微博短文本流热点话题检测方法
作者:
作者单位:

1.山西大学计算机与信息技术学院,太原 030006;2.山西大学计算智能与中文信息处理教育部重点实验室,太原 030006

作者简介:

通讯作者:

基金项目:

国家自然科学基金(62072294, 62076158, 61906112, 41871286);山西省重点研发计划(201803D421024, 201903D421041)。


Hot Topic Detection Method of Microblog Short Text Stream Based on Feature Extension
Author:
Affiliation:

1.School of Computer and Information Technology, Shanxi University, Taiyuan 030006, China;2.Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University,Taiyuan 030006, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    随着社交网络和互联网的飞速发展,产生了大量的微博短文本流数据。及时发现微博文本流中热点话题,对话题推荐和舆情监测等有重要作用。为了解决微博短文本特征稀疏问题,利用微博评论对微博进行特征扩展,提出了一种基于特征扩展的微博短文本流热点话题检测方法(Feature extension-based hot topic detection, FE-HTD)。首先利用评论用户的影响力以及评论文本的点赞数筛选评论文本,并使用词共现和词频-逆文档频率(Term frequency-inverse document frequency,TF-IDF)方法从选取的评论文本中抽取特征词完成对微博文本的特征扩展;然后计算微博文本流的词对速度、词对加速度,并根据点赞数、评论数计算微博文本强度,结合词对加速度与微博文本强度定义突发特征;最后,根据突发词对的速度确定可变长的热点话题窗口范围,通过聚类得到窗口中热点话题的主题结构。实验中,将所提算法与基于文本的话题检测(Text-based topic detection, T-TD)和基于突发词的话题检测(Burst words-based topic detection, BW-TD)进行对比实验。结果表明,本文算法FE-HTD准确率达76.4%,召回率达78.7%,与对比算法T-TD和BW-TD相比提高了10%。

    Abstract:

    With the rapid development of social networks and Internet, a large number of microblog short text stream data have been produced. Discovering hot topics from microblog text streams in time plays an important role in topic recommendation and public opinion monitoring. To solve the problem of sparse features of microblog, a feature extension-based hot topic detection (FE-HTD) method in microblog short text stream is proposed by using microblog comments to extend the features of microblog. To complete the feature extension of the microblog text, firstly, the comment text is selected by the influence of the comment users and the number of likes for comment text, and the feature words are extracted from the comment text by word co-occurrence and term frequency-inverse document frequency (TF-IDF) method. Then count the word pair speed, word pair acceleration and microblog text strength of the microblog short text stream. The burst feature is calculated by word pair acceleration and microblog text strength. Finally, the variable length window range of hot topic is determined according to the speed of the burst word pair, and the topic structure of hot topic in the window is obtained by clustering. In the experiment, the proposed algorithm is compared with the text-based topic detection (T-TD) method and the burst words-based topic detection (BW-TD) method. The results show that the accuracy of the proposed algorithm is 76.4%, and the recall rate is 78.7%,which are 10% higher than those of T-TD and BW-TD methods.

    参考文献
    相似文献
    引证文献
引用本文

李艳红,谢梦娜,王素格,李德玉.基于特征扩展的微博短文本流热点话题检测方法[J].数据采集与处理,2022,37(3):621-632

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2021-03-24
  • 最后修改日期:2021-12-20
  • 录用日期:
  • 在线发布日期: 2022-05-25