To better identify and classify environmental sound, a multilevel residual network (Mul-EnvResNet) is proposed for environmental sound classification. After time stretch and pitch shift for sound events, the Mel-frequency cepstral coefficients (MFCCs) and their deltas are extracted as feature parameters and sent into the Mul-EnvResNet to classify sound events. The experimental data set uses ESC-50, Mul-EnvResNet is compared with the end-to-end convolutional neural network (EnvNet), the attention based convolutional recurrent neural network (ACRNN) and the unsupervised filterbank learning using convolutional restricted Boltzmann machine (ConvRBM). The experimental results show that, Mul-EnvResNet achieves the best accuracy rate of 89.32% in terms of classification accuracy, compared with the above three models, the classification accuracy has been improved by 18.32%, 3.22% and 2.82%, respectively, which also has obvious advantages compared with other sound classification methods.
表 1 不同模型和不同卷积核大小的短连接下的准确率Table 1 Accuracy of different models and shortcut with different convolution kernel sizes
表 2 不同模型下分类准确率和训练时间Table 2 Classification accuracy and training time under different models
图1 基于Mul-EnvResNet的ESC流程图Fig.1 ESC process based on Mul-EnvResNet
图2 残差块的结构Fig.2 Structure of residual block
图3 EnvResNet结构与残差块Fig.3 Structure of EnvResNet and residual block
图4 Mul-EnvResNet结构与多级残差块Fig.4 Structure of Mul-EnvResNet and multilevel residual block
图5 Mul-EnvResNet训练和测试曲线图Fig.5 Multilevel residual network training and test curves
表 3 ESC-50上各模型对比实验结果Table 3 Camparison of experimental results of various models on ESC-50