Abstract:For the data clustering problem of incomplete information system, the set pair analysis theory is introduced into k-means clustering. At the same time, to better represent the relationship between the sample and the cluster, a set pair k-means(SPKM) clustering algorithm for incomplete information system is constructed. Firstly, a set pair distance measurement method is proposed according to set pair theory, and the measurement method is applied to the k-means algorithm to obtain the preliminary clustering results. Then, for samples belonging to multiple clusters at the same time, the samples are assigned into the boundary region of the corresponding clusters. And for samples belonging to only one cluster, it is assigned into the positive region or boundary region of the corresponding clusters. The clustering results are expressed by three parts, which are the positive region belonging to the cluster, the boundary region that may belong to the cluster and the negative region which does not belong to the cluster. Finally, six data sets in the UCI database and four contrast algorithms are selected for experimental evaluation. Experimental results show that the SPKM algorithm has good clustering performance in accuracy, F1 value, Jaccard coefficient, FMI and ARI.