基于统计特征的Quality Phrase挖掘方法
作者:
作者单位:

1.河北师范大学计算机与网络空间安全学院,石家庄,050024;2.河北师范大学河北省供应链大数据分析与数据安全工程研究中心,石家庄,050024;3.河北师范大学河北省网络与信息安全重点实验室,石家庄,050024;4.河北地质大学信息工程学院,石家庄,050031;5.河北师范大学数学科学学院,石家庄,050024

作者简介:

通讯作者:

基金项目:

国家社会科学基金重大(13&ZD091,18ZDA200)资助项目。


Quality Phrase Mining Method Based on Statistic Features
Author:
Affiliation:

1.College of Computer and Cyber Security, Hebei Normal University, Shijiazhuang, 050024,China;2.Hebei Provincial Engineering Research Center for Supply Chain Big Data Analytics & Data Security, Hebei Normal University, Shijiazhuang, 050024, China;3.Key Laboratory of Network & Information Security, Hebei Normal University, Shijiazhuang, 050024, China;4.College of Information Engineering, Hebei GEO University, Shijiazhuang, 050031, China;5.School of Mathematical Sciences, Hebei Normal University, Shijiazhuang, 050024, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    Quality Phrase挖掘是从文本语料库中提取有意义短语的过程,是文档摘要、信息检索等任务的基础。然而现有的无监督短语挖掘方法存在候选短语质量不高、Quality Phrase的特征权重平均分配的问题。本文提出基于统计特征的Quality Phrase挖掘方法,将频繁N-Gram挖掘、多词短语组合性约束及单词短语拼写检查相结合,保证了候选短语的质量;引入公共知识库对候选短语添加类别标签,实现了Quality Phrase特征权重的分配,并考虑特征之间相互影响设置惩罚因子调整权重比例;按照候选短语的特征加权函数得分排序,提取Quality Phrase。实验结果表明,基于统计特征的Quality Phrase挖掘方法明显提高了短语挖掘的精度,与最优的无监督短语挖掘方法相比,精确率、召回率及F1-Score分别提升了5.97%,1.77%和4.02%。

    Abstract:

    Quality Phrase mining is a process of extracting meaningful phrases from text corpus, which is the basis of tasks such as document summary and information retrieval. However, the existing unsupervised phrase mining methods have problems of low quality of candidate phrases and average distribution of feature weight of Quality Phrase. Therefore, a Quality Phrase mining method based on statistic features is proposed. This method combines frequent N-Gram mining, combinatorial constraints of multi-word phrases, and spell checking to ensure the quality of candidate phrases. The public knowledge base is introduced to add labels to the candidate phrases, and the weight distribution of Quality Phrase is realized. The penalty factor is set to adjust the weight ratio considering the mutual influence between the features. The Quality Phrase is extracted according to the score of the feature weighting function of the candidate phrases. Experimental results show that the Quality Phrase mining method based on statistic features significantly improves the precision of phrase mining. Compared with the optimal unsupervised phrase mining methods, the precision, recall and F1-Score values are improved by 5.97%, 1.77%, and 4.02%, respectively.

    参考文献
    相似文献
    引证文献
引用本文

杨欢欢,赵书良,李文斌,武永亮,田国强.基于统计特征的Quality Phrase挖掘方法[J].数据采集与处理,2020,35(3):458-473

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2019-09-19
  • 最后修改日期:2019-12-11
  • 录用日期:
  • 在线发布日期: 2020-05-25