Speech Deepfake Attribution: The State of the Art and Prospects
CSTR:
Author:
Affiliation:

1Army Engineering University of PLA, Nanjing 210007, China;2Information Support Force Engineering University, Wuhan 430000, China

Clc Number:

TN912

Fund Project:

National Natural Science Foundation of China (Nos.62371469,62071484).

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    With the rapid evolution of generative artificial intelligence, speech deepfake technologies have achieved unprecedented realism, enabling the synthesis of highly natural and speaker-specific speech from only a few seconds of reference audio. While traditional countermeasures have primarily focused on binary detection—such approaches are insufficient for forensic investigation, legal accountability, and security governance. In real-world adversarial scenarios, it is not enough to determine whether speech is fake; it is equally critical to identify how it was generated, whose voice characteristics were exploited, and which specific model instance may have been involved. This paradigm shift from “detection” to “attribution” marks a fundamental transformation in speech security research. This paper presents a comprehensive survey of speech deepfake attribution, systematically organizing the field into a hierarchical forensic framework that includes three progressive tasks: forgery method attribution, source speaker attribution, and model inversion. Forgery method attribution aims to identify the generative architecture or vocoder family responsible for producing the fake speech by exploiting intrinsic “model fingerprints” embedded in spectral, temporal, and phase domains. Source speaker tracing focuses on recovering or verifying the identity of the original speaker whose voice was converted, leveraging residual prosodic, behavioral, and physiological cues that survive imperfect disentanglement in voice conversion systems. Model inversion represents a deeper forensic objective, attempting to infer specific model parameters or configurations from generated speech, thereby bridging the gap between class-level attribution and instance-level accountability. From both the perspectives of generative model mechanisms and physical acoustic characteristics of speech signals, the feasible core principles for each subtask are elaborated. Different dimensions, such as architectural frameworks and training strategies, are distinguished to systematically organize the research status, mainstream methodologies, and technological evolution paths of each subtask. Furthermore, benchmark datasets and evaluation metrics for both closed-set and open-set scenarios are systematically summarized. Finally, the paper discusses emerging challenges such as open-world generalization, robustness under complex channel distortions and neural codecs, adversarial attacks, and ethical constraints related to privacy and legal admissibility. Future directions are outlined toward proactive traceability, model-level reverse engineering, robust feature disentanglement, and the integration of active watermarking with passive forensic techniques. The survey aims to provide a structured roadmap for advancing speech deepfake attribution and fostering a trustworthy digital speech ecosystem.Highlights:1. A hierarchical framework for speech deepfake attribution is systematically established, unifying forgery method attribution, source speaker tracing, and model inversion into a progressive forensic paradigm beyond binary real/fake detection.2. The intrinsic mechanisms of attribution are analyzed from generative model fingerprints and acoustic signal characteristics, revealing how architectural design, training strategies, and inference processes leave distinguishable trace patterns.3. Open-world robustness, complex channel conditions, and model instance reverse engineering are identified as key challenges, with future directions proposed toward proactive traceability and a comprehensive speech security defense ecosystem.

    Reference
    Related
    Cited by
Get Citation

ZHANG Xiongwei, ZHANG Qiang, SUN Meng, YANG Jibin, LI Yihao, GE Xiaoyi. Speech Deepfake Attribution: The State of the Art and Prospects[J]. Journal of Data Acquisition and Processing,2026,(2):347-370.

Copy
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:January 10,2026
  • Revised:February 27,2026
  • Adopted:
  • Online: April 15,2026
  • Published:
Article QR Code