InterSpeech 2021

Harmonic WaveGAN: GAN-Based Speech Waveform Generation Model with Harmonic Structure Discriminator
(3 minutes introduction)

Kazuki Mizuta (University of Tokyo, Japan), Tomoki Koriyama (University of Tokyo, Japan), Hiroshi Saruwatari (University of Tokyo, Japan)
This paper proposes Harmonic WaveGAN, a GAN-based waveform generation model that focuses on the harmonic structure of a speech waveform. Our proposed model uses two discriminators to capture characteristics of a speech waveform in a time domain and in a frequency domain, respectively. In one of them, a harmonic structure discriminator, a 2-D convolution layer called “harmonic convolution” is inserted to model a harmonic structure of a speech waveform. Although harmonic convolution has been shown to perform well in audio restoration tasks, this convolution layer has not yet been fully explored in the field of speech synthesis. Therefore, we seek to improve the perceptual quality of speech samples synthesized by the waveform generation model and investigate the usefulness of harmonic convolution in the field of speech synthesis. Mean opinion score tests showed that the Harmonic WaveGAN can synthesize more natural speech than conventional Parallel WaveGAN. We also showed that a spectrogram of a speech waveform showed a clearer harmonic structure when synthesized by our model than a speech waveform synthesized by the original Parallel WaveGAN.