A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling <BR>(Oral presentation)

A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling
(Oral presentation)

Xiaoyu Bie (LJK (UMR 5224), France), Laurent Girin (GIPSA-lab (UMR 5216), France), Simon Leglaive (IETR (UMR 6164), France), Thomas Hueber (GIPSA-lab (UMR 5216), France), Xavier Alameda-Pineda (LJK (UMR 5224), France)

The Variational Autoencoder (VAE) is a powerful deep generative model that is now extensively used to represent high-dimensional complex data via a low-dimensional latent space learned in an unsupervised manner. In the original VAE model, input data vectors are processed independently. In recent years, a series of papers have presented different extensions of the VAE to process sequential data, that not only model the latent space, but also model the temporal dependencies within a sequence of data vectors and corresponding latent vectors, relying on recurrent neural networks. We recently performed a comprehensive review of those models and unified them into a general class called Dynamical Variational Autoencoders (DVAEs). In the present paper, we present the results of an experimental benchmark comparing six of those DVAE models on the speech analysis-resynthesis task, as an illustration of the high potential of DVAEs for speech modeling.

InterSpeech 2021

A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling
(Oral presentation)

Search in Audio

Related Recordings

Fricative Phoneme Detection Using Deep Neural Networks and its Comparison to Traditional Methods
(Oral presentation)

Identification of F1 and F2 in speech using modified zero frequency filtering
(Oral presentation)

InterSpeech 2021

A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling (Oral presentation)

Search in Audio

Related Recordings

Fricative Phoneme Detection Using Deep Neural Networks and its Comparison to Traditional Methods (Oral presentation)

Identification of F1 and F2 in speech using modified zero frequency filtering (Oral presentation)

A Benchmark of Dynamical Variational Autoencoders applied to Speech Spectrogram Modeling
(Oral presentation)

Fricative Phoneme Detection Using Deep Neural Networks and its Comparison to Traditional Methods
(Oral presentation)

Identification of F1 and F2 in speech using modified zero frequency filtering
(Oral presentation)