基于预训练模型的目标音频处理研究进展
作者:
作者单位:

1山东大学信息科学与工程学院,青岛266237;2中国科学院数学与系统科学研究院,北京100190

作者简介:

通讯作者:

基金项目:


Research Progress in Target Audio Processing Methods Based on Pre-trained Models
Author:
Affiliation:

1School of Information Science and Engineering, Shandong University, Qingdao 266237, China;2Academy of Mathematics and Systems Science of the Chinese Academy of Sciences, Beijing 100190, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    目标音频处理旨在根据用户提供的线索从混合信号中恢复或识别特定目标声源,是人机交互、智慧办公及多媒体取证等领域的关键技术。本文对近年来作者团队基于预训练模型的目标音频处理研究进展进行了概述。首先,回顾了目标说话人语音识别、语音提取、目标音频提取及音源分离等方向的研究现状,介绍了Whisper、对比学习语言音频预训练(Contrastive language-audio pretraining, CLAP)等预训练模型及参数高效微调技术。针对目标音频提取和目标说话人识别任务综述了作者团队研究的基于对比学习的多模态查询目标音频提取方法、无需配对数据的语言查询目标音频提取方法、基于多任务学习的目标说话人语音提取方法,以及基于提示微调的目标说话人语音识别方法等。这些方法分别在多模态泛化、标注数据依赖、语义保持与参数效率等方面取得了显著进展。最后,对推理效率提升、多模态深度融合、开放域泛化及通用目标音频处理大模型的构建等未来研究方向进行了展望。

    Abstract:

    Target audio processing aims to recover or identify a specific target sound source from mixed audio signals based on user-provided cues. As an important branch of audio signal processing and machine listening, it plays a vital role in a wide range of applications, including human-computer interaction, smart office environments, assistive technologies, and multimedia forensics. In recent years, the emergence of large-scale pre-trained models has opened up new possibilities for target audio processing by significantly improving representation learning, cross-modal understanding, and adaptation to low-resource conditions. This paper presents an overview of the recent research progress made by our team in this area, with particular emphasis on the integration of pre-trained models into target audio processing frameworks. First, we review the research status of several related tasks, including target speaker automatic speech recognition, speech extraction, target audio extraction, and sound source separation, and introduce representative pre-trained models such as Whisper and contrastive language-audio pretraining (CLAP) together with parameter-efficient fine-tuning strategies. Focusing on the tasks of target audio extraction and target speaker recognition, we then summarize our recent studies, including a contrastive-learning-based multimodal query method for target audio extraction, a language-queried target audio extraction method that removes the reliance on paired training data, a multitask-learning-based method for target speaker speech extraction, and a prompt-tuning-based method for target speaker automatic speech recognition. These studies have achieved substantial advances in multimodal generalization, reduction of labeled-data dependence, preservation of target semantic information, and parameter-efficient model adaptation. We further show that the combination of pre-trained models and task-oriented fine-tuning provides an effective pathway toward more robust and flexible target audio processing systems. Finally, we discuss several future research directions, including improving inference efficiency, promoting deeper multimodal fusion, enhancing open-domain generalization, and developing universal foundation models for target audio processing.

    参考文献
    相似文献
    引证文献
引用本文

刘琚,马豪,李晓航,李玉楷,司媛,邢志坤,王芷涵,邵明杰.基于预训练模型的目标音频处理研究进展[J].数据采集与处理,2026,(2):397-415

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2026-01-19
  • 最后修改日期:2026-03-09
  • 录用日期:
  • 在线发布日期: 2026-04-15