Skip to the content.

Abstract

Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while facing the following challenges: 1) The highly dynamic style features in expressive voice are difficult to model and transfer; and 2) the TTS models should be robust enough to handle diverse OOD conditions that differ from the source data. This paper proposes GenerSpeech, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice. GenerSpeech decomposes the speech variation into the style-agnostic and style-specific parts by introducing two components: 1) a multi-level style adaptor to efficiently model a large range of style conditions, including global speaker and emotion characteristics, and the local (utterance, phoneme, and word-level) fine-grained prosodic representations; and 2) a generalizable content adaptor with Mix-Style Layer Normalization to eliminate style information in the linguistic content representation and thus improve model generalization. Our evaluations on zero-shot style transfer demonstrate that GenerSpeech surpasses the state-of-the-art models in terms of audio quality and style similarity. The extension studies to adaptive style transfer further show that GenerSpeech performs robustly in the few-shot data setting.

Parallel Style Transfer

In parallel style transfer, the synthesizer is given an audio clip matching the text it’s asked to synthesize (i.e. the reference and target text are the same).

VCTK dataset

Reference/Target Text: The rainbow is a division of white light into many beautiful colors.

Reference Reference (voc) Mellotron FG-Transfromer Expressive FastSpeech 2 Meta-StyleSpeech Styler GenerSpeech

Reference/Target Text: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.

Reference Reference (voc) Mellotron FG-Transfromer Expressive FastSpeech 2 Meta-StyleSpeech Styler GenerSpeech

ESD dataset

Reference/Target Text: But if you hadn’t done them.

Reference Reference (voc) Mellotron FG-Transfromer Expressive FastSpeech 2 Meta-StyleSpeech Styler GenerSpeech

Reference/Target Text: I say neither yea nor nay.

Reference Reference (voc) Mellotron FG-Transfromer Expressive FastSpeech 2 Meta-StyleSpeech Styler GenerSpeech

Non-Parallel Transfer

In non-parallel style transfer, the TTS system must transfer prosodic style when the source and target text are completely different. Below, contrast the monotonous prosody of the baseline with examples of long-form synthesis with a narrative source style.

VCTK dataset

Reference Text: When the sunlight strikes raindrops in the air, they act as a prism and form a rainbow.

Reference Audio

Target Text: We also need a small plastic snake and a big toy frog for the kids.

Mellotron FG-Transfromer Expressive FastSpeech 2 Meta-StyleSpeech Styler GenerSpeech

Reference Text: The rainbow is a division of white light into many beautiful colors.

Reference Audio

Target Text: Six spoons of fresh snow peas, five thick slabs of blue cheese, and maybe a snack for her brother Bob.

Mellotron FG-Transfromer Expressive FastSpeech 2 Meta-StyleSpeech Styler GenerSpeech

ESD dataset

Reference Text: I say neither yea nor nay.

Reference Audio

Target Text: I know you.

Mellotron FG-Transfromer Expressive FastSpeech 2 Meta-StyleSpeech Styler GenerSpeech

Reference Text: All this we have won by our labour.

Reference Audio

Target Text: Because he was a man with infinite resource and sagacity.

Mellotron FG-Transfromer Expressive FastSpeech 2 Meta-StyleSpeech Styler GenerSpeech