Abstract:Aiming at the problem of limited strong annotation datasets and the sharp degradation of detection performance in real-world scenarios for polyphonic sound event detection tasks, a method for polyphonic sound event detection based on transfer learning convolutional retentive network is proposed. Firstly, the method utilizes convolutional blocks with pre-trained weights to extract local features of audio data. Subsequently, the local features, along with orientation features, are input into the residual feature enhancement module for feature fusion and channel dimension reduction. The fused features are then fed into the retentive network with regularization methods to further learn the temporal information in the audio data. Experimental results demonstrate that, compared to the champion system model of the DCASE challenge, the method achieves a reduction in error rates by 0.277 and 0.106, and an increase in F1 scores by 22.6% and 6.6% on the development and evaluation sets of the DCASE 2016 Task3 dataset. On the development and evaluation sets of the DCASE 2017 Task3 dataset, the error rates are reduced by 0.22 and 0.123, and the F1 scores increase by 17.2% and 14.4%.