基于压缩的本地差分隐私的序列数据收集方法
DOI:
作者:
作者单位:

南京航空航天大学

作者简介:

通讯作者:

基金项目:

江苏省重点研发计划(产业前瞻与关键核心技术)项目,国家自然科学基金项目


Sequential Data Collection Method with Condensed Local Differential Privacy
Author:
Affiliation:

Nanjing University of Aeronautics and Astronautics

Fund Project:

Jiangsu Province Key Research and Development Plan (Industrial Foresight and Key Core Technology) project, the National Natural Science Foundation of China

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    序列数据的现实应用广泛,比如,用户行为分析、自然语言处理、推荐系统等。然而,隐私问题限制了这些数据的共享与利用。为此,学者提出了基于压缩的本地差分隐私的解决方案,压缩的本地差分隐私是本地差分隐私的一种基于度量的放松形式,它具有比本地差分隐私更好的效用性和灵活性。但是,现有方案在序列模式捕捉和效用性方面存在不足。为了克服这些局限性,本文提出了SCM-CLDP,一种新颖的基于压缩的本地差分隐私的序列数据收集方法。SCM-CLDP在收集过程中充分考虑了序列数据的长度、转移等重要信息,通过这些信息,数据收集者能够合成接近原始数据集的隐私保护的数据集。具体而言,根据扰动对象的不同,我们提出了两种收集方法,分别是基于值扰动的SCM-VP和基于转移扰动的SCM-TP。我们理论证明了SCM-VP和SCM-TP满足序列级别的压缩的本地差分隐私,并基于两个真实数据集,在Markov链模型准确性、合成数据集效用性及频繁序列模式挖掘准确性上,与现有方案进行了对比实验。结果表明,SCM-CLDP表现出显著的优势,其中,在大多数情况下,SCM-VP的性能优于SCM-TP。并且,在最优的情况下,相较于现有方法,SCM-CLDP在Markov链模型及合成数据集分布误差方面至少降低了一个数量级。同时,SCM-CLDP在合成数据集中各项频率排序的准确性以及频繁序列模式挖掘的准确性方面,相较于现有方法提升了近30%。

    Abstract:

    Sequential data has a wide range of real-world applications, e.g., user behavior analysis, natural language processing, recommender systems, etc. However, privacy concerns have limited the sharing and use of such data. To overcome this challenge, scholars have proposed a solution based on condensed local differential privacy, which is a metric-based relaxation of local differential privacy with better utility and flexibility than local differential privacy. However, existing solutions are deficient in terms of sequence pattern capture and utility. To address these limitations, in this paper, we propose SCM-CLDP, a novel sequential data collection method based on condensed local differential privacy. SCM-CLDP fully takes into account important information such as the length and transitions of sequential data during the collection process, through which the data collector is able to synthesize privacy-preserving dataset close to the original dataset. Specifically, according to the different perturbation objects, we propose two collection methods, SCM-VP based on value perturbation and SCM-TP based on transition perturbation, respectively. We theoretically prove that SCM-VP and SCM-TP satisfy sequence-level condensed local differential privacy, and comparative experiments are conducted with existing solutions based on two real datasets in terms of Markov chain model accuracy, synthetic dataset utility, and frequent sequence pattern mining accuracy. The results show that SCM-CLDP performs significantly better than the existing solutions, with SCM-VP outperforming SCM-TP in most cases. In the optimal situation, SCM-CLDP reduces the error of the Markov chain model and the distribution of the synthetic dataset by at least one order of magnitude compared to the existing method. Meanwhile, SCM-CLDP improves the accuracy of item frequency ranking of the synthetic dataset and the accuracy of frequent sequence pattern mining by nearly 30% compared to existing solutions.

    参考文献
    相似文献
    引证文献
引用本文
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2024-06-19
  • 最后修改日期:2024-10-22
  • 录用日期:2025-02-24
  • 在线发布日期: