Abstract:Traditional clustering algorithms can not meet the requirements of current big data processing because of the limitations of stand-alone memory and computing power. Therefore it is urgent to find new solutions. Aiming at problems occurred in stand-alone memory calculating, combined with iterative computing features of clustering algorithms, a clustering system based on Spark platform is proposed. For the two different types of data sets, which are sparse sets and dense sets, the system firstly uses different strategies to achieve data preprocessing. Secondly, the performance of different clustering algorithms on Spark platform is analyzed and the best solution is given. Finally, the computing speed is improved with data persistence technology. Experimental results show that the proposed system can effectively meet the requirements of massive data clustering analysis.