Abstract:Most of the existing sound event detection methods based on deep learning utilize normal 2D convolution, while the characteristics of its translation invariance do not apply to audio clips, which makes it difficult to detect complex sound events. To address the issue, a hybrid convolutional neural network based on feature fusion is proposed. By calculating the distribution of audio spectrum and generating adaptive convolutional kernels, the proposed model extracts feature maps with local details and physical consistency to audio spectrum dynamically. Moreover, the proposed model captures long-distance feature relationships by applying self-attention mechanism in parallel. In order to fill the semantic gap between local details together with global relationships, a feature fusion module is proposed to concatenate them validly. Besides, to improve the detection performance of neural network, an enhanced bidirectional gated recurrent unit based on multi-resolution attention module is proposed to refine the fused feature representations. It emphasizes the frames where sound events tend to be active and suppress those frames tend to be background. The experiment results on the DCASE2020 dataset indicate that the proposed model has achieved an F1-score of 52.57%, which outperforms other existing methods.