Abstract
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting.
Parallel Style Transfer
In parallel style transfer, the synthesizer is given an audio clip matching the text it’s asked to synthesize (i.e. the reference and target text are the same).
VCTK dataset
Reference/Target Text: The rainbow is a division of white light into many beautiful colors.
Reference | Reference (voc) | Mellotron | FG-Transfromer | Expressive FastSpeech 2 | Meta-StyleSpeech | Styler | GenerSpeech |
---|---|---|---|---|---|---|---|
Reference/Target Text: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.
Reference | Reference (voc) | Mellotron | FG-Transfromer | Expressive FastSpeech 2 | Meta-StyleSpeech | Styler | GenerSpeech |
---|---|---|---|---|---|---|---|
ESD dataset
Reference/Target Text: But if you hadn’t done them.
Reference | Reference (voc) | Mellotron | FG-Transfromer | Expressive FastSpeech 2 | Meta-StyleSpeech | Styler | GenerSpeech |
---|---|---|---|---|---|---|---|
Reference/Target Text: I say neither yea nor nay.
Reference | Reference (voc) | Mellotron | FG-Transfromer | Expressive FastSpeech 2 | Meta-StyleSpeech | Styler | GenerSpeech |
---|---|---|---|---|---|---|---|
Non-Parallel Transfer
In non-parallel style transfer, the TTS system must transfer prosodic style when the source and target text are completely different. Below, contrast the monotonous prosody of the baseline with examples of long-form synthesis with a narrative source style.
VCTK dataset
Reference Text: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.
Reference Audio |
---|
Target Text: We also need a small plastic snake and a big toy frog for the kids.
Mellotron | FG-Transfromer | Expressive FastSpeech 2 | Meta-StyleSpeech | Styler | GenerSpeech |
---|---|---|---|---|---|
Reference Text: The rainbow is a division of white light into many beautiful colors.
Reference Audio |
---|
Target Text: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.
Mellotron | FG-Transfromer | Expressive FastSpeech 2 | Meta-StyleSpeech | Styler | GenerSpeech |
---|---|---|---|---|---|
ESD dataset
Reference Text: I say neither yea nor nay.
Reference Audio |
---|
Target Text: I know you.
Mellotron | FG-Transfromer | Expressive FastSpeech 2 | Meta-StyleSpeech | Styler | GenerSpeech |
---|---|---|---|---|---|
Reference Text: All this we have won by our labour.
Reference Audio |
---|
Target Text: Because he was a man with infinite resource and sagacity.
Mellotron | FG-Transfromer | Expressive FastSpeech 2 | Meta-StyleSpeech | Styler | GenerSpeech |
---|---|---|---|---|---|