基于MapReduce框架的分布式软K段主曲线算法
作者:
作者单位:

作者简介:

通讯作者:

基金项目:


Distributed Soft K-Segments Algorithm for Principal Curves Based on MapReduce
Author:
Affiliation:

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    传统的主曲线算法在小规模数据集上能获得良好的效果,但单节点的计算和存储能力都不能满足海量数据主曲线的提取要求,而算法分布式并行化是目前解决该类问题最有效的途径之一。本文提出基于MapReduce框架的分布式软K段主曲线算法 (Distributed soft k-segments principal curve,DisSKPC)。首先,基于分布式K-Means算法,采用递归粒化方法对数据集进行粒化,以确定粒的大小并保证粒中数据的关联性。然后调用软K段主曲线算法计算每个粒数据的局部主成分线段,并提出用噪声方差来消除在高密集、高曲率的数据区域可能产生的过拟合线段。最后借助哈密顿路径和贪婪算法连接这些局部主成分线段,形成一条通过数据云中间的最佳曲线。实验结果表明,本文所提出的DisSKPC算法具有良好的可行性和扩展性。

    Abstract:

    The traditional principal curves algorithm can obtain good results on small datasets. But the computing and storage resources of a single node cannot meet the requirements of the extraction of principal curves on massive datasets. Distributed parallel computing is one of the most effective way to solve the problems. Therefore, we proposed a distributed soft K-segments algorithm for principal curves based on MapReduce, named DisSKPC. First, we recursively granulated all the numerical data into information granules to limit each granular size and ensure the relevance of the data in the granules using the distributed K-Means algorithm. Then we calculated the local principal component segments of each granule and eliminated over-fitting segments that may arise in the area of high-density and high-curvature using the noise variance. Finally, we connected these local principal component segments using the Hamiltonian path and greedy algorithm, forming a best curve through the middle of the data cloud. Experimental results demonstrate the feasibility and scalability of the proposed DisSKPC algorithm.

    参考文献
    相似文献
    引证文献
引用本文

胡作梁张红云.基于MapReduce框架的分布式软K段主曲线算法[J].数据采集与处理,2017,32(3):507-515

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:
  • 最后修改日期:
  • 录用日期:
  • 在线发布日期: 2017-06-28