Abstract:Sequential data has a wide range of real-world applications, e.g., user behavior analysis, natural language processing, recommender systems, etc. However, privacy concerns have limited the sharing and use of such data. To overcome this challenge, scholars have proposed a solution based on condensed local differential privacy, which is a metric-based relaxation of local differential privacy with better utility and flexibility than local differential privacy. However, existing solutions are deficient in terms of sequence pattern capture and utility. To address these limitations, in this paper, we propose SCM-CLDP, a novel sequential data collection method based on condensed local differential privacy. SCM-CLDP fully takes into account important information such as the length and transitions of sequential data during the collection process, through which the data collector is able to synthesize privacy-preserving dataset close to the original dataset. Specifically, according to the different perturbation objects, we propose two collection methods, SCM-VP based on value perturbation and SCM-TP based on transition perturbation, respectively. We theoretically prove that SCM-VP and SCM-TP satisfy sequence-level condensed local differential privacy, and comparative experiments are conducted with existing solutions based on two real datasets in terms of Markov chain model accuracy, synthetic dataset utility, and frequent sequence pattern mining accuracy. The results show that SCM-CLDP performs significantly better than the existing solutions, with SCM-VP outperforming SCM-TP in most cases. In the optimal situation, SCM-CLDP reduces the error of the Markov chain model and the distribution of the synthetic dataset by at least one order of magnitude compared to the existing method. Meanwhile, SCM-CLDP improves the accuracy of item frequency ranking of the synthetic dataset and the accuracy of frequent sequence pattern mining by nearly 30% compared to existing solutions.