多尺度富有表现力的汉语语音合成
作者:
作者单位:

1.南京航空航天大学计算机科学与技术学院,南京 211106;2.国家电网公司华中分部,武汉 430070

作者简介:

通讯作者:

基金项目:


Multi-scale Expressive Chinese Speech Synthesis
Author:
Affiliation:

1.College of Computer Science and Technology, Nanjing University of Aeronautics & Astronautics, Nanjing 211106, China;2.Central China Branch of State Grid Corporation of China, Wuhan 430070, China

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    常见的增强合成语音表现力方法通常是将参考音频编码为固定维度的韵律嵌入,与文本信息一起输入语音合成模型的解码器,从而向语音合成模型中引入变化的韵律信息,但这种方法仅提取了音频整体级别的韵律信息,忽略了字或音素级别的细粒度韵律信息,导致合成语音依然存在部分字词发音不自然、音调语速平缓的现象。针对这些问题,本文提出一种基于Tacotron2语音合成模型的多尺度富有表现力的汉语语音合成方法。该方法利用基于变分自编码器的多尺度韵律编码网络,提取参考音频整体级别的韵律信息和音素级别的音高信息,然后将其与文本信息一起输入语音合成模型的解码器。此外,在训练过程中通过最小化韵律嵌入与音高嵌入之间的互信息,消除不同特征表示之间的相互关联,分离不同特征表示。实验结果表明,该方法与单一尺度的增强表现力语音合成方法相比,听力主观平均意见得分提高了约2%,基频F0帧错误率降低了约14%,该方法可以生成更加自然且富有表现力的语音。

    Abstract:

    Common methods for enhancing the expressiveness of synthesized speech typically involve encoding the reference audio as a fixed-dimensional prosody embedding. This embedding is then fed into the decoder of the speech synthesis model along with the text embedding, thereby introducing prosody information into the speech synthesis process. However, this approach only captures prosody information at the global level of speech, neglecting fine-grained prosody details at the word or phoneme level. Consequently, the synthesized speech may still exhibit unnatural pronunciation and flat intonation in certain words. To tackle these issues, this paper introduces a multi-scale expressive Chinese speech synthesis method based on Tacontron2. Initially, two variational auto-encoders are employed to extract global-level prosody information and phoneme-level pitch information from the reference audio. This multi-scale variational information is then incorporated into the speech synthesis model. Additionally, during the training process, we minimize the mutual information between the rhyme embedding and the pitch embedding. This step aims to eliminate intercorrelation between different feature representations and to separate distinct feature representations. Experimental results demonstrate that our proposed method enhances the subjective mean opinion score by 2% and reduces the F0 frame error rate by 14% compared to the single-scale expressive speech synthesis method. The findings suggest that our method generates speech that is more natural and expressive.

    参考文献
    相似文献
    引证文献
引用本文

高洁,肖大军,徐遐龄,刘绍翰,杨群.多尺度富有表现力的汉语语音合成[J].数据采集与处理,2023,38(6):1458-1468

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
历史
  • 收稿日期:2023-01-13
  • 最后修改日期:2023-06-28
  • 录用日期:
  • 在线发布日期: 2023-11-25