类别混叠度对非均衡数据分类的有效性分析
作者:
作者单位:

作者简介:

邢延(1968-),女,博士,副教授,研究方向:模式识别、数据挖掘,E-mail:yanxing@gdut.edu.cn;汪新(1962-),男,博士,教授,研究方向:数值模拟、计算流体力学,E-mail:xinwang@gdut.edu.cn;陈嘉锋(1993-),男,硕士研究生,研究方向:模式识别、数据挖掘,E-mail:jiafengchan@126.com;贾小彦(1986-),女,硕士,研究方向:模式识别、数据挖掘,E-mail:707883587@qq.com。

通讯作者:

基金项目:

国家自然科学基金(51378128)资助项目;广东省自然科学基金(2015A030313498)资助项目。


Evaluation of Class Overlap Measures on Imbalanced Data Classification
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    类别混叠度是指不同类别数据之间互相交叠、混合的程度,其量化指标包含基于几何统计的和基于信息论的两类,用于衡量数据分类的难易。实际分类任务中存在大量的非均衡数据,大类与小类样本之间悬殊的数量差别给分类造成了极大的困难。本文采用实验研究的方法,验证类别混叠度量化指标指导非均衡数据分类的有效性,以减少甚至避免盲目试错带来的庞大计算开销。首先,针对两类分类问题,设计验证实验,在不同类数据非均衡率,不同别边界形状、不同特征类型、不同概率分布的非均衡仿真数据上研究类别混叠度的有效性。其次,在实验研究的基础上,分析数据的非均衡性对类别混叠度的影响规律,找出类别混叠度指导非均衡分类的有效方法。最后,在真实的非均衡数据上验证类别混叠度指导非均衡分类的实际效果。实验结果表明,对数据的非均衡率具有较强鲁棒性的类别混叠度量化指标可以有效地指导非均衡数据的分类器选择。

    Abstract:

    Class overlap is defined as the overlay degree of data from different classes, quantified by the approaches of geometrical statistics and information theory, and it is used to measure the complexity of a classification. There are imbalanced data in the real world, and the great disparity of the sample amounts challenges classification. With the help of experiments, we evaluate the efficiency of the class overlap measures on imbalanced data classification. Firstly, focusing on two-class classification, the experiments are designed to evaluate the efficiency of the class overlap measures on synthetic unbalanced data, which are generated with various skewness, class boundary shapes, feature types and probability distributions. Secondly, according to the experimental results on the artificial data, the influence rules of the imbalanced ratio on the measures are analyzed, then the ways of the measures to guide unbalanced data classification are concluded. Finally, the conclusions are evaluated on the real-world imbalanced data sets. The experimental results demonstrate that those measures with higher robustness on data skeness can efficiently guide classifiers selection for imbalanced data classification.

    参考文献
    相似文献
    引证文献
引用本文

邢延, 陈嘉锋, 贾小彦, 汪新.类别混叠度对非均衡数据分类的有效性分析[J].数据采集与处理,2018,33(5):936-944

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2017-06-12
  • 最后修改日期:2017-07-10
  • 录用日期:
  • 在线发布日期: 2018-10-29