Abstract:Focusing on the data redundancy and load-unbalancing problems in the ex isting parallel FP-Growth algorithm, an improved algorithm for redundancy pruning and load balancing is proposed. Firstly, the improved algorithm introduces group task estimation method when using high-frequency strategies group. The longest path in the maximum pattern tree and the highest frequency are used as estimation. The group task will be averaged to others again when the group estimated is much larger than the value of other group. Then, the repetitive elements are removed in the list of different groups. Experimental results show that the improved algorithm avoides the data skew in the MapReduce and it is superior to the original one due to its high execution efficiency and speedup.