Multi-scale Expressive Chinese Speech Synthesis
CSTR:
Author:
Affiliation:

1.College of Computer Science and Technology, Nanjing University of Aeronautics & Astronautics, Nanjing 211106, China;2.Central China Branch of State Grid Corporation of China, Wuhan 430070, China

Clc Number:

TP391

Fund Project:

  • Article
  • |
  • Figures
  • |
  • Metrics
  • |
  • Reference
  • |
  • Related
  • |
  • Cited by
  • |
  • Materials
  • |
  • Comments
    Abstract:

    Common methods for enhancing the expressiveness of synthesized speech typically involve encoding the reference audio as a fixed-dimensional prosody embedding. This embedding is then fed into the decoder of the speech synthesis model along with the text embedding, thereby introducing prosody information into the speech synthesis process. However, this approach only captures prosody information at the global level of speech, neglecting fine-grained prosody details at the word or phoneme level. Consequently, the synthesized speech may still exhibit unnatural pronunciation and flat intonation in certain words. To tackle these issues, this paper introduces a multi-scale expressive Chinese speech synthesis method based on Tacontron2. Initially, two variational auto-encoders are employed to extract global-level prosody information and phoneme-level pitch information from the reference audio. This multi-scale variational information is then incorporated into the speech synthesis model. Additionally, during the training process, we minimize the mutual information between the rhyme embedding and the pitch embedding. This step aims to eliminate intercorrelation between different feature representations and to separate distinct feature representations. Experimental results demonstrate that our proposed method enhances the subjective mean opinion score by 2% and reduces the F0 frame error rate by 14% compared to the single-scale expressive speech synthesis method. The findings suggest that our method generates speech that is more natural and expressive.

    Reference
    Related
    Cited by
Get Citation

GAO Jie, XIAO Dajun, XU Xialing, LIU Shaohan, YANG Qun. Multi-scale Expressive Chinese Speech Synthesis[J].,2023,38(6):1458-1468.

Copy
Related Videos

Share
Article Metrics
  • Abstract:
  • PDF:
  • HTML:
  • Cited by:
History
  • Received:January 13,2023
  • Revised:June 28,2023
  • Adopted:
  • Online: November 25,2023
  • Published:
Article QR Code