Research Progress in Target Audio Processing Methods Based on Pre-trained Models

doi:10.16337/j.1004-9037.2026.02.007

Home > Archive>Volume , Issue 2, 2026 >397-415. DOI:10.16337/j.1004-9037.2026.02.007

Research Progress in Target Audio Processing Methods Based on Pre-trained Models
DOI:
                        10.16337/j.1004-9037.2026.02.007
                    
CSTR:
                        
Author:
                        
Affiliation:1School of Information Science and Engineering, Shandong University, Qingdao 266237, China;2Academy of Mathematics and Systems Science of the Chinese Academy of Sciences, Beijing 100190, China
Clc Number:TP183
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Target audio processing aims to recover or identify a specific target sound source from mixed audio signals based on user-provided cues. As an important branch of audio signal processing and machine listening， it plays a vital role in a wide range of applications， including human-computer interaction， smart office environments， assistive technologies， and multimedia forensics. In recent years， the emergence of large-scale pre-trained models has opened up new possibilities for target audio processing by significantly improving representation learning， cross-modal understanding， and adaptation to low-resource conditions. This paper presents an overview of the recent research progress made by our team in this area， with particular emphasis on the integration of pre-trained models into target audio processing frameworks. First， we review the research status of several related tasks， including target speaker automatic speech recognition， speech extraction， target audio extraction， and sound source separation， and introduce representative pre-trained models such as Whisper and contrastive language-audio pretraining （CLAP） together with parameter-efficient fine-tuning strategies. Focusing on the tasks of target audio extraction and target speaker recognition， we then summarize our recent studies， including a contrastive-learning-based multimodal query method for target audio extraction， a language-queried target audio extraction method that removes the reliance on paired training data， a multitask-learning-based method for target speaker speech extraction， and a prompt-tuning-based method for target speaker automatic speech recognition. These studies have achieved substantial advances in multimodal generalization， reduction of labeled-data dependence， preservation of target semantic information， and parameter-efficient model adaptation. We further show that the combination of pre-trained models and task-oriented fine-tuning provides an effective pathway toward more robust and flexible target audio processing systems. Finally， we discuss several future research directions， including improving inference efficiency， promoting deeper multimodal fusion， enhancing open-domain generalization， and developing universal foundation models for target audio processing.

Reference

Cited by

Get Citation

LIU Ju, MA Hao, LI Xiaohang, LI Yukai, SI Yuan, XING Zhikun, WANG Zhihan, SHAO Mingjie. Research Progress in Target Audio Processing Methods Based on Pre-trained Models[J]. Journal of Data Acquisition and Processing,2026,(2):397-415.

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:January 19,2026
Revised:March 09,2026
Adopted:
Online: April 15,2026
Published:

For Authors

Get Citation

Related Videos

Share

Article Metrics

History

Article QR Code