End-to-end Adversarial Text-to-Speech

Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference. It uses a differentiable monotonic interpolation scheme to predict the duration of each input token, and learns through a combination of adversarial feedback and soft dynamic time warping-based prediction losses. This model learns to synthesise high quality speech audio from normalised text or phonemes paired with ground truth speech audio alone, without any additional supervision, achieving a mean opinion score exceeding 4.0 on a 5 point scale.

Authors' notes

EATS Samples

EATS (main model)
Ablation: No Phonemes (character input)


EATS (main model)
Ablation: No Phonemes (character input)
Ablation: No RWDs
Ablation: No MelSpecD
Ablation: No Discriminators
Ablation: No Monotonic Interpolation
Ablation: No DTW
Ablation: Single Speaker


EATS, Speaker #1
EATS, Speaker #2
EATS, Speaker #3
EATS, Speaker #4