Diff-TTS: A Denoising Diffusion Model for Text-to-Speech <BR>(3 minutes introduction)

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech
(3 minutes introduction)

Myeonghun Jeong (Seoul National University, Korea), Hyeongju Kim (Neosapience, Korea), Sung Jun Cheon (Seoul National University, Korea), Byoung Jin Choi (Seoul National University, Korea), Nam Soo Kim (Seoul National University, Korea)

Although neural text-to-speech (TTS) models have attracted a lot of attention and succeeded in generating human-like speech, there is still room for improvements to its naturalness and architectural efficiency. In this work, we propose a novel non-autoregressive TTS model, namely Diff-TTS, which achieves highly natural and efficient speech synthesis. Given the text, Diff-TTS exploits a denoising diffusion framework to transform the noise signal into a mel-spectrogram via diffusion time steps. In order to learn the mel-spectrogram distribution conditioned on the text, we present a likelihood-based optimization method for TTS. Furthermore, to boost up the inference speed, we leverage the accelerated sampling method that allows Diff-TTS to generate raw waveforms much faster without significantly degrading perceptual quality. Through experiments, we verified that Diff-TTS generates 28 times faster than the real-time with a single NVIDIA 2080Ti GPU.

A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning based on Rényi Divergence Minimization
(3 minutes introduction)

Dipjyoti Paul , Sankar Mukherjee , Yannis Pantazis , Yannis Stylianou

InterSpeech 2021

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech
(3 minutes introduction)

Search in Audio

Related Recordings

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
(3 minutes introduction)

A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning based on Rényi Divergence Minimization
(3 minutes introduction)

InterSpeech 2021

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech (3 minutes introduction)

Search in Audio

Related Recordings

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration (3 minutes introduction)

A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning based on Rényi Divergence Minimization (3 minutes introduction)

Diff-TTS: A Denoising Diffusion Model for Text-to-Speech
(3 minutes introduction)

Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration
(3 minutes introduction)

A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning based on Rényi Divergence Minimization
(3 minutes introduction)