Abstract:Random sample partition (RSP) data model distributedly represents a big data set as a set of RSP data blocks stored on a computing cluster. The RSP data model guarantees that the probability distribution of each data block is statistically consistent to the probability distribution of whole big data set. Thus, each RSP data block is a random sample of big data set and can be used to estimate the statistical properties of big data set or establish the classification and regression models. Based on the RSP data model, the big data analysis can be conducted by analyzing RSP data blocks rather than the whole big data set. This significantly reduces the computational complexity and improves the computing performance of cluster system on big data analysis. In this paper, we firstly present the definition, basic theory and generation method of RSP. Second, we introduce an asymptotic ensemble learning framework called Alpha framework used for big data analysis. Third, we discuss the main big data analysis methods based on the RSP data model and Alpha framework, including data exploration & cleaning, probability density function estimation, supervised subspace learning, semi?supervised ensemble learning, clustering ensemble and outlier detection. Finally, we discuss the innovations and advantages of the RSP data model and Alpha framework in big data analysis by using the divide-and-conquer strategy on random samples.