基于多特征融合的无监督真值发现方法
作者:
作者单位:

1.江苏师范大学信息化建设与管理处, 徐州 221116;2.江苏师范大学计算机科学与技术学院, 徐州 221116

作者简介:

通讯作者:

基金项目:

国家自然科学基金(61872168);江苏省研究生科研与实践创新项目(KYCX20_2382)。


Unsupervised Truth Discovery Method Based on Multi-feature Fusion
Author:
Affiliation:

1.Department of Information Construction and Management, Jiangsu Normal University, Xuzhou 221116, China;2.College of Computer Science and Technology, Jiangsu Normal University, Xuzhou 221116, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    真值发现是数据集成领域具有挑战性的研究热点之一。传统的方法利用数据源与观测值之间的交互关系推断真值,缺乏足够的特征信息;基于深度学习的方法可以有效地进行特征抽取,但其性能依赖于大量手工标注,而在实际应用中很难获取到大量高质量的真值标签。为克服以上问题,本文提出一种基于多特征融合的无监督真值发现方法(Unsupervised truth discovery method based on multi-feature fusion, MFOTD)。首先,利用集成学习无监督标注“真值”标签;然后,分别使用预训练模型 Bert和独热编码获取观测值的语义特征和交互特征;最后,融合观测值多种特征并使用其“真值”标签构建初始训练集,通过自训练方式训练真值预测模型。在两个真实数据集上的实验结果表明,与已有方法相比,本文所提出的方法具有更高的真值发现准确性。

    Abstract:

    Truth discovery is one of the challenging research hotspots in the field of data integration. Traditional methods use the interaction between data sources and values to infer the truth, which lack sufficient feature information. Deep learning-based methods can effectively perform feature extraction, but their performance depends on a large number of manual annotations, and it is difficult to obtain a large number of high-quality truth labels in practical applications. To overcome these problems, this paper proposes an unsupervised truth discovery method based on multi-feature fusion(MFUTD). First, ensemble learning is used to label truth without supervision. Then, the pre-training Bert model and the one-hot coding method are used to obtain the semantic features and interactive features of the values. Finally, the initial training set is constructed by fusing multiple features of the values and using their “truth” labels to train the truth prediction model by self-training. Experimental results on two real data sets show that the proposed method has the higher truth discovery accuracy than the existing methods.

    参考文献
    相似文献
    引证文献
引用本文

陈华凤,董永权,杨昊霖,张国玺.基于多特征融合的无监督真值发现方法[J].数据采集与处理,2023,38(3):629-642

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2022-06-24
  • 最后修改日期:2022-07-19
  • 录用日期:
  • 在线发布日期: 2023-06-09