一种基于Tri-training的数据流集成分类算法
作者:
作者单位:

作者简介:

通讯作者:

基金项目:


Data Stream Ensemble Classification Algorithm Based on Tri-training
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    数据流分类是数据挖掘领域的重要研究任务之一,已有的数据流分类算法大多是在有标记数据集上进行训练,而实际应用领域数据流中有标记的数据数量极少。为解决这一问题,可通过人工标注的方式获取标记数据,但人工标注昂贵且耗时。考虑到未标记数据的数量极大且隐含大量信息,因此在保证精度的前提下,为利用这些未标记数据的信息,本文提出了一种基于Tri-training的数据流集成分类算法。该算法采用滑动窗口机制将数据流分块,在前k块含有未标记数据和标记数据的数据集上使用Tri-training训练基分类器,通过迭代的加权投票方式不断更新分类器直到所有未标记数据都被打上标记,并利用k个Tri-training集成模型对第k+1块数据进行预测,丢弃分类错误率高的分类器并在当前数据块上重建新分类器从而更新当前模型。在10个UCI数据集上的实验结果表明:与经典算法相比,本文提出的算法在含80%未标记数据的数据流上的分类精度有显著提高。

    Abstract:

    Data stream classification is one of important research tasks in the field of data mining. Most existing data stream classification algorithms require the labeled data for training. However, there are few labeled data in data streams in real applications. To solve this problem, the labeled data can be obtained by manual labeling, but it is very expensive and time consuming. Considering the unlabeled data are huge and full of information, a data stream ensemble classification algorithm based on Tri-training for labeled and unlabeled data is proposed in this paper. The proposed algorithm divides data stream into chunks by sliding windows and trains base classifiers with Tri-training on the first coming k chunks with labeled and unlabeled data. Then the classifiers are iteratively updated by weighted voting until all unlabeled data are labeled. Meanwhile, the k+1 data chunk is predicted by using the ensemble model of k Tri-training classifiers and the classifier with higher classification error is discarded, which reconstructs a new classifier on current data chunk to update the model. Experiments on 10 UCI data sets show that the proposed algorithm can significantly improve the class ification accuracy of data stream even with 80% unlabeled data in comparison with traditional algorithms.

    参考文献
    相似文献
    引证文献
引用本文

胡学钢 马利伟 李培培.一种基于Tri-training的数据流集成分类算法[J].数据采集与处理,2017,32(5):853-860

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2018-04-10