Research Progress in Target Audio Processing Methods Based on Pre-trained Models
CSTR:
Author:
Affiliation:

1School of Information Science and Engineering, Shandong University, Qingdao 266237, China;2Academy of Mathematics and Systems Science of the Chinese Academy of Sciences, Beijing 100190, China

Clc Number:

TP183

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Target audio processing aims to recover or identify a specific target sound source from mixed audio signals based on user-provided cues. As an important branch of audio signal processing and machine listening, it plays a vital role in a wide range of applications, including human-computer interaction, smart office environments, assistive technologies, and multimedia forensics. In recent years, the emergence of large-scale pre-trained models has opened up new possibilities for target audio processing by significantly improving representation learning, cross-modal understanding, and adaptation to low-resource conditions. This paper presents an overview of the recent research progress made by our team in this area, with particular emphasis on the integration of pre-trained models into target audio processing frameworks. First, we review the research status of several related tasks, including target speaker automatic speech recognition, speech extraction, target audio extraction, and sound source separation, and introduce representative pre-trained models such as Whisper and contrastive language-audio pretraining (CLAP) together with parameter-efficient fine-tuning strategies. Focusing on the tasks of target audio extraction and target speaker recognition, we then summarize our recent studies, including a contrastive-learning-based multimodal query method for target audio extraction, a language-queried target audio extraction method that removes the reliance on paired training data, a multitask-learning-based method for target speaker speech extraction, and a prompt-tuning-based method for target speaker automatic speech recognition. These studies have achieved substantial advances in multimodal generalization, reduction of labeled-data dependence, preservation of target semantic information, and parameter-efficient model adaptation. We further show that the combination of pre-trained models and task-oriented fine-tuning provides an effective pathway toward more robust and flexible target audio processing systems. Finally, we discuss several future research directions, including improving inference efficiency, promoting deeper multimodal fusion, enhancing open-domain generalization, and developing universal foundation models for target audio processing.

    Reference
    Related
    Cited by
Get Citation

LIU Ju, MA Hao, LI Xiaohang, LI Yukai, SI Yuan, XING Zhikun, WANG Zhihan, SHAO Mingjie. Research Progress in Target Audio Processing Methods Based on Pre-trained Models[J]. Journal of Data Acquisition and Processing,2026,(2):397-415.

Copy
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:January 19,2026
  • Revised:March 09,2026
  • Adopted:
  • Online: April 15,2026
  • Published:
Article QR Code