语音深度伪造溯源技术研究现状及展望
作者:
作者单位:

1陆军工程大学,南京 210007;2信息支援部队工程大学,武汉 430000

作者简介:

通讯作者:

基金项目:

国家自然科学基金(62371469,62071484)。


Speech Deepfake Attribution: The State of the Art and Prospects
Author:
Affiliation:

1Army Engineering University of PLA, Nanjing 210007, China;2Information Support Force Engineering University, Wuhan 430000, China

Fund Project:

National Natural Science Foundation of China (Nos.62371469,62071484).

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    随着生成式人工智能技术的快速发展,语音深度伪造技术日益精进,其生成的语音在听感上已难辨真假,给信息安全、司法取证和社会互信带来严峻挑战。传统的语音伪造检测重点在于解决语音“真/假”的二元分类问题。然而,在复杂的安全对抗与取证场景中,仅判定语音的真或假已无法满足追根溯源、厘清责任的需求。本文聚焦“语音伪造溯源”这一前沿课题,系统综述了国内外当前的研究进展。首先,构建了一个层级化的语音伪造溯源任务体系,明确界定了伪造方法溯源、源说话人溯源和模型逆向这3个子任务的内涵。然后,从生成模型的基本原理、语音信号的声学特性等角度,阐述了各子任务可行的核心机理;区分体系架构、训练策略等不同维度,系统地梳理了各子任务的研究现状、主流方法及技术演进路径。最后,总结了当前研究面临的开放世界溯源、复杂信道条件下溯源等关键挑战,展望了面向语音深度伪造反制的主动溯源等未来的发展方向,旨在为构建更完善的语音安全防御体系提供参考。

    Abstract:

    With the rapid evolution of generative artificial intelligence, speech deepfake technologies have achieved unprecedented realism, enabling the synthesis of highly natural and speaker-specific speech from only a few seconds of reference audio. While traditional countermeasures have primarily focused on binary detection—such approaches are insufficient for forensic investigation, legal accountability, and security governance. In real-world adversarial scenarios, it is not enough to determine whether speech is fake; it is equally critical to identify how it was generated, whose voice characteristics were exploited, and which specific model instance may have been involved. This paradigm shift from “detection” to “attribution” marks a fundamental transformation in speech security research. This paper presents a comprehensive survey of speech deepfake attribution, systematically organizing the field into a hierarchical forensic framework that includes three progressive tasks: forgery method attribution, source speaker attribution, and model inversion. Forgery method attribution aims to identify the generative architecture or vocoder family responsible for producing the fake speech by exploiting intrinsic “model fingerprints” embedded in spectral, temporal, and phase domains. Source speaker tracing focuses on recovering or verifying the identity of the original speaker whose voice was converted, leveraging residual prosodic, behavioral, and physiological cues that survive imperfect disentanglement in voice conversion systems. Model inversion represents a deeper forensic objective, attempting to infer specific model parameters or configurations from generated speech, thereby bridging the gap between class-level attribution and instance-level accountability. From both the perspectives of generative model mechanisms and physical acoustic characteristics of speech signals, the feasible core principles for each subtask are elaborated. Different dimensions, such as architectural frameworks and training strategies, are distinguished to systematically organize the research status, mainstream methodologies, and technological evolution paths of each subtask. Furthermore, benchmark datasets and evaluation metrics for both closed-set and open-set scenarios are systematically summarized. Finally, the paper discusses emerging challenges such as open-world generalization, robustness under complex channel distortions and neural codecs, adversarial attacks, and ethical constraints related to privacy and legal admissibility. Future directions are outlined toward proactive traceability, model-level reverse engineering, robust feature disentanglement, and the integration of active watermarking with passive forensic techniques. The survey aims to provide a structured roadmap for advancing speech deepfake attribution and fostering a trustworthy digital speech ecosystem.Highlights:1. A hierarchical framework for speech deepfake attribution is systematically established, unifying forgery method attribution, source speaker tracing, and model inversion into a progressive forensic paradigm beyond binary real/fake detection.2. The intrinsic mechanisms of attribution are analyzed from generative model fingerprints and acoustic signal characteristics, revealing how architectural design, training strategies, and inference processes leave distinguishable trace patterns.3. Open-world robustness, complex channel conditions, and model instance reverse engineering are identified as key challenges, with future directions proposed toward proactive traceability and a comprehensive speech security defense ecosystem.

    参考文献
    相似文献
    引证文献
引用本文

张雄伟,张强,孙蒙,杨吉斌,李毅豪,葛晓义.语音深度伪造溯源技术研究现状及展望[J].数据采集与处理,2026,(2):347-370

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2026-01-10
  • 最后修改日期:2026-02-27
  • 录用日期:
  • 在线发布日期: 2026-04-15