Abstract:This paper proposes a two-stage deep learning-based approach for target speech extraction, effectively enhancing the performance of single-channel speech separation. To address the tendency of discriminative models to over-optimize objective metrics during extraction resulting in unnatural artifacts, distortions, and insufficient subjective auditory quality, we innovatively integrate a diffusion model as a second-stage refinement module. Constructed within a stochastic differential equation framework, this module regenerates and optimizes the preliminary output from the discriminative model, reducing false extraction rates while significantly improving the naturalness and harmonic structure clarity of the target speech. Experiments on the WSJ0-2mix-extr dataset demonstrate an 11.18% improvement (from 3.22 to 3.58) on the NISQA metric, which simulates human auditory perception, indicating substantial enhancements in perceived speech quality and naturalness. Subjective listening tests via CMOS further validate the method’s effectiveness in improving speech clarity and intelligibility.