基于MapReduce和上采样的两类非平衡大数据集成分类
作者:
作者单位:

作者简介:

翟俊海(1964-),男,教授,研究方向:机器学习与数据挖掘,E-mail:mczjh@126.com;张明阳(1991-),男,硕士研究生,研究方向:云计算与大数据处理;王陈希(1988-),男,硕士研究生,研究方向:机器学习;刘晓萌(1987-),女,硕士研究生,研究方向:机器学习。

通讯作者:

基金项目:

国家自然科学基金(71371063)资助项目;河北省自然科学基金(F2017201026)资助项目;河北大学自然科学研究计划(799207217071)资助项目。


Binary Ensemble Classification for Imbalanced Big Data Based on MapRecuce and Upper Sampling
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    提出了一种基于MapReduce和上采样的两类非平衡大数据分类方法,该方法分为5步:(1)对于每一个正类样例,用MapReduce寻找其异类最近临;(2)在两个样例点之间的直线上生成若干个正类样例;(3)以新的正类样例子集的大小为基准,将负类样例随机划分为若干子集;(4)用负类样例子集和正类样例子集构造若干个平衡数据子集;(5)用平衡数据子集训练若干个分类器,并对训练好的分类器进行集成。在5个两类非平衡大数据集上与3种相关方法进行了实验比较,实验结果表明本文提出的优于这3种方法。

    Abstract:

    Based on MapReduce and upper sampling, an approach for imbalanced big data classification is proposed in this paper. The proposed method includes five steps:(1) For each positive instance, its nearest neighbor is found by MapReduce. (2) Some positive instances on the line between the two points are created. (3) According to the cardinality of the set of positive instances, the set of negative instances is partitioned into some subsets. (4) Some balanced subsets are generated with the set of positive instances and the subset of negative instances. (5) Some classifiers are trained by extreme learning machine on the generated balanced subsets, and the trained classifiers are integrated by majority voting for classifying new instances. Experimental comparisons with three related methods are conducted on five imbalanced big data sets. The experimental results show that the proposed method outperforms the three methods.

    参考文献
    相似文献
    引证文献
引用本文

翟俊海, 张明阳, 王陈希, 刘晓萌, 王耀达.基于MapReduce和上采样的两类非平衡大数据集成分类[J].数据采集与处理,2018,33(3):416-425

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2016-06-07
  • 最后修改日期:2016-11-29
  • 录用日期:
  • 在线发布日期: 2018-07-09