一种融合激励和颤音建模的端到端歌唱合成方法

doi:10.16337/j.1004-9037.2024.02.013

首页 > 按月查看>2024年第2月 >406-415. DOI:10.16337/j.1004-9037.2024.02.013

一种融合激励和颤音建模的端到端歌唱合成方法
DOI:
                        10.16337/j.1004-9037.2024.02.013
                    
作者:
                        
                        
                    
作者单位:1.科大讯飞股份有限公司，合肥 230088;2.中国科学技术大学信息科学技术学院，合肥 230026
作者简介:
通讯作者:
基金项目:科技创新2030——“新一代人工智能”重大项目（2020AAA0103600）。

An End-to-End Singing Voice Synthesis Method with Excitation and Vibrato Modeling

Author:

Affiliation:

1.iFLYTEK Co.Ltd.,Hefei 230088, China;2.School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China

Fund Project:

摘要

图/表

访问统计

参考文献

相似文献

引证文献

资源附件

摘要:

近年来，歌唱合成技术快速发展，基于变分推理和流模型的端到端歌唱合成（VISinger）成为主流，但其在效果上和真人仍有一定差距，主要体现在合成歌声中的音高听感不连续、颤音合成不佳及发音不稳定等。为此，本文针对性地提出了一系列改进方法：针对基频稳定性问题，提出在解码器中增加激励模块，将基频信息以激励信号的形式显式提供给解码器；针对颤音合成不自然问题，增加颤音预测模块，通过流式模型和变分数据增强，显式对歌声中的颤音进行建模；进一步在先验网络中增加ReZero策略。实验结果显示，增加激励信号能提升合成基频的稳定性，颤音建模对颤音的恢复有显著提升作用，ReZero策略对训练速度和发音稳定性有一定提升。主观测听中，本文提出的模型在歌唱合成自然度上相比VISinger有显著优势，平均意见分（Mean opinion score， MOS）达到3.95，对比两阶段建模方法DiffSinger+HiFiGAN也有明显优势，证明了本文所提方法的有效性。

Abstract:

In recent years， singing voice synthesis technology has developed rapidly， and end-to-end singing voice synthesis （VISinger） based on variational inference and normalizing flow has become mainstream. But there is still a certain gap between its effect and the sound quality of real persons， which is mainly reflected in the discontinuous hearing of pitch， poor synthesis of vibrato， and unstable articulation in the synthesized singing voice.We propose three main improvements. Firstly， to address the problem of fundamental frequency stability， we propose to add an excitation module in the decoder to explicitly provide the fundamental frequency information to the decoder in the form of an excitation signal； secondly， to address the problem of unnatural vibrato synthesis， we add a vibrato prediction module to explicitly model the vibrato in the song using flow with variational data augmentation； thirdly， we further add a ReZero strategy to the frame prior network. Experimental results show that increasing the excitation signal can improve the stability of the synthesized fundamental frequency， the vibrato modeling has a significant enhancement effect on the recovery of vibrato， and the ReZero strategy has a certain improvement on the training speed and articulation stability. Subjective evaluation demonstrates that the proposed model has a significant advantage over VISinger in the naturalness of singing voice synthesis， with mean opinion score （MOS） reaching 3.95， and also has a significant advantage over the two-stage modeling method DiffSinger+HiFiGAN， proving the effectiveness of the proposed method.

参考文献

相似文献

引证文献

引用本文

周骁,胡亚军,潘嘉,胡国平,凌震华.一种融合激励和颤音建模的端到端歌唱合成方法[J].数据采集与处理,2024,(2):406-415

复制

文章指标

点击次数:
下载次数:

历史

收稿日期:2022-12-13
最后修改日期:2023-01-20
录用日期:
在线发布日期: 2024-04-22

引用本文

分享

文章指标

历史