Audio Samples from "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech"
Abstract: Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.
Contents
Single Speaker (LJ Speech Dataset)
Text
|
that not more than one bottle of wine or one quart of beer could be issued at one time. No account was taken of the amount of liquors admitted in one day, |
The prisoner had nothing to deal with but wooden panels, and by dint of cutting and chopping he got both the lower panels out. |
have now come into general use and are obviously a great improvement on the ordinary "modern style" in use in England, which is in fact the Bodoni type |
At two:thirty-eight p.m., Eastern Standard Time, Lyndon Baines Johnson took the oath of office as the thirty-sixth President of the United States. |
The boy declared he saw no one, and accordingly passed through without paying the toll of a penny. |
| Ground Truth |
|
|
|
|
|
| Tacotron 2 + HiFi-GAN |
|
|
|
|
|
| Tacotron 2 + HiFi-GAN (fine-tuned) |
|
|
|
|
|
| Glow-TTS + HiFi-GAN |
|
|
|
|
|
| Glow-TTS + HiFi-GAN (fine-tuned) |
|
|
|
|
|
| VITS (DDP) |
|
|
|
|
|
| VITS |
|
|
|
|
|
Multi-Speaker (VCTK Dataset)
Text
|
The teacher would have approved. |
The rainbow is a division of white light into many beautiful colors. |
There was great support all round the route. |
Brown is an interesting man, but he is not desperate. |
Military action is the only option we have on the table today. |
| Ground Truth |
|
|
|
|
|
| Tacotron 2 + HiFi-GAN |
|
|
|
|
|
| Tacotron 2 + HiFi-GAN (fine-tuned) |
|
|
|
|
|
| Glow-TTS + HiFi-GAN |
|
|
|
|
|
| Glow-TTS + HiFi-GAN (fine-tuned) |
|
|
|
|
|
| VITS |
|
|
|
|
|
Voice Conversion
| From\To | VCTK 260 | VCTK 287 | VCTK 247 | VCTK 330 | VCTK 310 |
| VCTK 260 |
|
|
|
|
|
| VCTK 287 |
|
|
|
|
|
| VCTK 247 |
|
|
|
|
|
| VCTK 330 |
|
|
|
|
|
| VCTK 310 |
|
|
|
|
|
Speech Variation
How much variation is there?
| VITS |
|
|
|
|
|
| Tacotron 2 + HiFi-GAN (fine-tuned) |
|
|
|
|
|
| Glow-TTS + HiFi-GAN (fine-tuned) |
|
|
|
|
|
| VITS (multi-speaker) |
|
|
|
|
|
Ablation Study
Text
|
that not more than one bottle of wine or one quart of beer could be issued at one time. No account was taken of the amount of liquors admitted in one day, |
The prisoner had nothing to deal with but wooden panels, and by dint of cutting and chopping he got both the lower panels out. |
have now come into general use and are obviously a great improvement on the ordinary "modern style" in use in England, which is in fact the Bodoni type |
At two:thirty-eight p.m., Eastern Standard Time, Lyndon Baines Johnson took the oath of office as the thirty-sixth President of the United States. |
The boy declared he saw no one, and accordingly passed through without paying the toll of a penny. |
| Ground Truth |
|
|
|
|
|
| VITS (300k training) |
|
|
|
|
|
| w/o Normalizing Flow (300k training) |
|
|
|
|
|
| w Mel-Spectrogram (300k training) |
|
|
|
|
|