Random Sample Partition Data Model and Related Technologies for Big Data Analysis
CSTR:
Author:
Affiliation:

1.Big Data Institute, College of Computer Science & Software Engineering, Shenzhen University, Shenzhen, 518060, China;2.National Engineering Laboratory for Big Data System Computing Technology, Shenzhen, 518060, China

Clc Number:

TN911.73

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Random sample partition (RSP) data model distributedly represents a big data set as a set of RSP data blocks stored on a computing cluster. The RSP data model guarantees that the probability distribution of each data block is statistically consistent to the probability distribution of whole big data set. Thus, each RSP data block is a random sample of big data set and can be used to estimate the statistical properties of big data set or establish the classification and regression models. Based on the RSP data model, the big data analysis can be conducted by analyzing RSP data blocks rather than the whole big data set. This significantly reduces the computational complexity and improves the computing performance of cluster system on big data analysis. In this paper, we firstly present the definition, basic theory and generation method of RSP. Second, we introduce an asymptotic ensemble learning framework called Alpha framework used for big data analysis. Third, we discuss the main big data analysis methods based on the RSP data model and Alpha framework, including data exploration & cleaning, probability density function estimation, supervised subspace learning, semi?supervised ensemble learning, clustering ensemble and outlier detection. Finally, we discuss the innovations and advantages of the RSP data model and Alpha framework in big data analysis by using the divide-and-conquer strategy on random samples.

    Reference
    Related
    Cited by
Get Citation

Huang Zhexue, He Yulin, Wei Chenghao, Zhang Xiaoliang. Random Sample Partition Data Model and Related Technologies for Big Data Analysis[J].,2019,34(3):373-385.

Copy
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:August 23,2018
  • Revised:March 01,2019
  • Adopted:
  • Online: June 12,2019
  • Published:
Article QR Code