Most existing deep learning-based sound event detection methods adopt the conventional 2D convolution. However, its inherent translation invariance property is incompatible with audio signals, rendering the model incompetent in detecting complex sound events. To address the issue, a hybrid convolutional neural network based on feature fusion is proposed. Specifically, by calculating the distribution of the audio spectrogram and adaptively generating convolutional kernels, the proposed model dynamically extracts local features that maintain physical consistency with the audio signal. Meanwhile, the self-attention mechanism is employed in parallel to capture long-distance feature dependencies of the spectrogram. To eliminate the semantic gap between local and global features, a feature fusion module is designed to effectively integrate these two distinct feature representations. Furthermore, to further enhance the detection performance of the proposed model, an improved bidirectional gated recurrent unit based on a multi-scale attention mechanism is proposed to fully refine the fused feature information, which emphasizes event-related frames and suppresses background frames. Experiment results on the DCASE2020 dataset indicate that the proposed model has achieved an F1-score of 52.57%, which outperforms other existing methods.