大数据随机样本划分模型及相关分析计算技术

doi:10.16337/j.1004-9037.2019.03.001

首页 > 按月查看>2019年第3月 >373-385. DOI:10.16337/j.1004-9037.2019.03.001

大数据随机样本划分模型及相关分析计算技术
DOI:
                        10.16337/j.1004-9037.2019.03.001
                    
作者:
                        
                        
                    
作者单位:1.深圳大学计算机与软件学院大数据技术与应用研究所，深圳，518060;2.深圳大学大数据系统计算技术国家工程实验室，深圳，518060
作者简介:
通讯作者:
基金项目:国家重点研发计划 2017YFC0822604-2;中国博士后科学基金 2016T90799;深圳大学2018年度新引进教师科研启动基金 2018060;广东省普通高校国家级重大培育基金 2014GKXM054国家重点研发计划（2017YFC0822604-2）资助项目；中国博士后科学基金（2016T90799）资助项目；深圳大学2018年度新引进教师科研启动基金（2018060）资助项目；广东省普通高校国家级重大培育基金（2014GKXM054）资助项目。

Random Sample Partition Data Model and Related Technologies for Big Data Analysis

Author:

Affiliation:

1.Big Data Institute, College of Computer Science & Software Engineering, Shenzhen University, Shenzhen, 518060, China;2.National Engineering Laboratory for Big Data System Computing Technology, Shenzhen, 518060, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

设计了一种新的适用于大数据的管理和分析模型——大数据随机样本划分（Random sample partition，RSP）模型，它是将大数据文件表达成一系列RSP数据块文件的集合，分布存储在集群节点上。RSP的生成操作使每个RSP数据块的分布与大数据的分布保持统计意义上的一致，因此，每个RSP数据块是大数据的一个随机样本数据，可以用来估计大数据的统计特征，或建立大数据的分类和回归模型。基于RSP模型，大数据的分析任务可以通过对RSP数据块的分析来完成，不需要对整个大数据进行计算，极大地减少了计算量，降低了对计算资源的要求，提高了集群系统的计算能力和扩展能力。本文首先给出RSP模型的定义、理论基础和生成方法；然后介绍基于RSP数据块的渐近式集成学习Alpha计算框架；之后讨论基于RSP模型和Alpha框架的大数据分析相关计算技术，包括：数据探索与清洗、概率密度函数估计、有监督子空间学习、半监督集成学习、聚类集成和异常点检测；最后讨论RSP模型在分而治之大数据分析和抽样方法上的创新，以及RSP模型和Alpha计算框架实现大规模数据分析的优势。

Abstract:

Random sample partition (RSP) data model distributedly represents a big data set as a set of RSP data blocks stored on a computing cluster. The RSP data model guarantees that the probability distribution of each data block is statistically consistent to the probability distribution of whole big data set. Thus, each RSP data block is a random sample of big data set and can be used to estimate the statistical properties of big data set or establish the classification and regression models. Based on the RSP data model, the big data analysis can be conducted by analyzing RSP data blocks rather than the whole big data set. This significantly reduces the computational complexity and improves the computing performance of cluster system on big data analysis. In this paper, we firstly present the definition, basic theory and generation method of RSP. Second, we introduce an asymptotic ensemble learning framework called Alpha framework used for big data analysis. Third, we discuss the main big data analysis methods based on the RSP data model and Alpha framework, including data exploration & cleaning, probability density function estimation, supervised subspace learning, semi?supervised ensemble learning, clustering ensemble and outlier detection. Finally, we discuss the innovations and advantages of the RSP data model and Alpha framework in big data analysis by using the divide-and-conquer strategy on random samples.

参考文献

相似文献

引证文献

引用本文

黄哲学,何玉林,魏丞昊,张晓亮.大数据随机样本划分模型及相关分析计算技术[J].数据采集与处理,2019,34(3):373-385

复制

文章指标

点击次数:
下载次数:

历史

收稿日期:2018-08-23
最后修改日期:2019-03-01
录用日期:
在线发布日期: 2019-06-12

引用本文

分享

文章指标

历史