Audio samples from "Quasi-Fully Convolutional Neural Network with Variational Inference for Speech Synthesis"

Authors: Mu Wang, Xixin Wu, Zhiyong Wu, Shiyin Kang, Deyi Tuo, Guangzhi Li, Dan Su, Dong Yu, Helen Meng
Abstract: Recurrent neural networks, such as gated recurrent units (GRUs) and long short-term memory (LSTM), are widely used on acoustic modeling for speech synthesis. However, such sequential generating processes are not friendly to today's massively parallel computing devices. We introduce a fully convolutional neural network (CNN) model, which can effiently run on parallel processers, for speech synthesis. To improve the quality of the generated acoustic features, we strengthen our model with variational inference. We also use quasi-recurrent neural networks (QRNNs) to smoothen the generated acoustic features. Finally, a high-quality parallel WaveNet model is used to generate audio samples. Our contributions are two-fold. First, we show that CNNs with variational inference can generate highly natural speech on a par with end-to-end models; the use of QRNNs further improves the synthetic quality by reducing trembling of generated acoustic features and introduces very little runtime overheads. Second, we show some techniques to further speed up the sampling process of the parallel WaveNet model.

Quasi-fully convolutional neural network with variational inference (QFCVI) VS. Tacotron(reduction rate = 3)

QFCVI - male speaker
Tacotron - male speaker
QFCVI - female speaker
Tacotron - female speaker

QFCVI - male speaker
Tacotron - male speaker
QFCVI - female speaker
Tacotron - female speaker

QFCVI - male speaker
Tacotron - male speaker
QFCVI - female speaker
Tacotron - female speaker

Further Speed Up Parallel WaveNet (PW)

Full-Precision PW
Mixed-Precision PW
Softsign PW

Full-Precision PW
Mixed-Precision PW
Softsign PW

Full-Precision PW
Mixed-Precision PW

Full-Precision PW
Mixed-Precision PW