基于核极限学习机的多标签数据流集成分类方法
作者:
作者单位:

1.大数据知识工程教育部重点实验室(合肥工业大学),合肥 230601;2.合肥工业大学计算机与信息学院,合肥 230601

作者简介:

通讯作者:

基金项目:

国家自然科学基金(61976077, 62076085)。


Multi-label Data Stream Ensemble Classification Approach Based on Kernel Extreme Learning Machine
Author:
Affiliation:

1.Key Laboratory of Big Data Knowledge Engineering Ministry of Education (Hefei University of Technology), Hefei 230601,China;2.School of Computer Science and Information Engineering, Hefei University of Technology, Hefei 230601,China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    极限学习机因具有高效处理、性能优越以及更少人工参数设定等优点,已成功应用于批处理多标签分类问题。然而,实际应用领域涌现的数据流呈现海量快速、多标签和概念漂移等特点,使得这些传统的多标签分类算法面临精度与时空的挑战。本文提出一种基于核极限学习机的多标签数据流集成分类方法。首先,为适应数据流环境,利用滑动窗口机制将数据流划分为数据块,在前k个数据块上构建k个核极限学习机的集成分类模型;同时,考虑类标签相关性,利用Apriori算法得到每个数据块的标签间的关联规则,并将关联规则中的同现标签的置信度引入到基于集成模型的预测过程中,以提高整体的分类精度;其次,引入MUENLForeset模型检测新到来的数据块是否发生概念漂移,对分类器设置损失函数更新集成模型以适应概念漂移问题。最后,在实际多标签数据上的大量实验表明:与经典多标签批处理和流数据分类方法相比,所提方法不仅能适应多标签数据流中的概念漂移问题,同时在分类精度上具有显著优势。

    Abstract:

    Extreme learning machine has a series of achievements on batch processing due to high-activity processing, superior performance, less manual parameter settings and so on, which has been successfully applied in multi-label classification. However, data streams emerging in the real-world applications present the characteristics of high-volume, high-speed, multi-label and concept drift, which poses the challenges in accuracy, time and space consumptions for traditional multi-label classification algorithms. Therefore, this paper proposes a multi-label classification data stream ensemble approach based on kernel extreme learning machine (KELM). Firstly, to adapt to the environment of data streams, the sliding window mechanism is used to partition data chunks, and an ensemble model consisted of k KELM models is built on k data chunks. Meanwhile, considering the label correlation, the Apriori algorithm is used to achieve the association rules of labels, and the confidence of label occurrence is introduced in the prediction using the generated model. Secondly, the MUENLForest model is introduced to detect whether a concept drift occurs in the new arriving data chunk, correspondingly the loss function is specified to update the ensemble model for adapting to concept drifts. Finally,massive experiments on the real multi label data sets demonstrate that the proposed approach outperforms the traditional multi label classification methods in accuracy and can adapt data drifts in multi label data streams quickly.

    表 5 3种算法在另外5个数据集上所有指标的实验结果Table 5 Experimental results of three multi-label algorithms on remaining five datasets regarding all evaluation metrics
    表 3 3 种算法在 5个数据集上2种指标的实验结果Table 3 Experimental results of three multi-label algorithms on five datasets regarding two evaluation metrics
    表 2 2 种算法在5个数据集上的实验结果Table 2 Experimental results of two multi-label algorithms on five datasets
    图1 本文方法整体框架图Fig.1 Framework of the proposed method
    图2 在Accuracy度量标准上的统计结果Fig.2 Statistic test on Accuracy
    图3 在F1-measure度量标准上的统计结果Fig.3 Statistic test on F1-measure
    图4 在Hamming loss度量标准上的统计结果Fig.4 Statistic test on Hamming loss
    图5 在Ranking loss度量标准上的统计结果Fig.5 Statistic test on Ranking loss
    图6 在Average precision度量标准上的统计结果Fig.6 Statistic test on Average precision
    表 1 数据集Table 1 Datasets
    表 4 3 种算法在 5个数据集上所有指标的实验结果Table 4 Experimental results of three multi-label algorithms on five datasets regarding all evaluation metrics
    参考文献
    相似文献
    引证文献
引用本文

张海翔,李培培,胡学钢.基于核极限学习机的多标签数据流集成分类方法[J].数据采集与处理,2022,37(1):183-193

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2020-07-12
  • 最后修改日期:2020-11-11
  • 录用日期:
  • 在线发布日期: 2022-01-29