融合卷积网络与残差长短时记忆网络的轻量级骨导语音盲增强

doi:10.16337/j.1004-9037.2021.05.007

首页 > 按月查看>2021年第5月 >921-931. DOI:10.16337/j.1004-9037.2021.05.007

融合卷积网络与残差长短时记忆网络的轻量级骨导语音盲增强
DOI:
                        10.16337/j.1004-9037.2021.05.007
                    
作者:
                        
                        
                    
作者单位:1.陆军工程大学指挥控制工程学院，南京 210007;2.火箭军士官学校，青州 262500
作者简介:
通讯作者:
基金项目:国家自然科学基金（62071484）资助项目。

Lightweight Model for Bone-Conducted Speech Enhancement Based on Convolution Network and Residual Long Short-Time Memory Network

Author:

Affiliation:

1.College of Command and Control Engineering, Army Engineering University of PLA, Nanjing 210007, China;2.Department of Test and Control, High-Tech Institute, Qingzhou 262500, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

基于深度学习的骨导语音盲增强已经取得了较好的效果，但仍存在模型体积大、计算复杂度高等问题。为此提出一种融合卷积网络和残差长短时记忆网络的轻量级骨导语音增强深度学习模型，该模型在保持语音增强质量的前提下，能有效提升骨导语音盲增强的效率。该模型借助卷积网络参数量小、特征提取能力强等优点，在语谱图频率维度引入卷积结构，从而深入挖掘时频结构的细节和高低频信息间的关联关系以提取新型特征，并将此新型特征输入改进后的长短时记忆网络中，用于恢复高频成分信息并重构语音信号。通过在骨导语音数据库上实验，表明所提模型可以有效改善高频成分的时频结构，在提升增强效果的同时，降低了模型体积和推理的计算复杂度。

Abstract:

Bone-conducted speech enhancement based on deep learning has reached a milestone recently. However， there are still some issues to prevent its real-world applications， such as large models and high computational complexities. In this paper， a lightweight deep learning model is proposed to improve the efficiency of bone-conducted speech enhancement. Inspired by the fact that convolution network has unique advantages in feature extraction with a few of parameters， convolution structures are introduced into the frequency dimensions of the spectrogram in our model. These structures can extract the details of the spectrogram in the time-frequency structures and explore the potential relationship between high and low frequency components. These new features extracted by CNN are fed into the improved long short-term memory network to recover high-frequency components information and reconstruct speech signals. From the experiments on bone conduction speech database， we can draw a conclusion that the proposed model can reconstruct the time-frequency details of the high-frequency components. While improving the enhancement performance， the model size and the computational complexity are reduced.