Evaluating the Extrapolation Capabilities of Neural Vocoders to Extreme Pitch Values
|Olivier Perrotin (GIPSA-lab (UMR 5216), France), Hussein El Amouri (GIPSA-lab (UMR 5216), France), Gérard Bailly (GIPSA-lab (UMR 5216), France), Thomas Hueber (GIPSA-lab (UMR 5216), France)|
Neural vocoders are systematically evaluated on homogeneous train and test databases. This kind of evaluation is efficient to compare neural vocoders in their “comfort zone”, yet it hardly reveals their limits towards unseen data during training. To compare their extrapolation capabilities, we introduce a methodology that aims at quantifying the robustness of neural vocoders in synthesising unseen data, by precisely controlling the ranges of seen/unseen data in the training database. By focusing in this study on the pitch (F₀) parameter, our methodology involves a careful splitting of a dataset to control which F₀ values are seen/unseen during training, followed by both global (utterance) and local (frame) evaluation of vocoders. Comparison of four types of vocoders (autoregressive, sourcefilter, flows, GAN) displays a wide range of behaviour towards unseen input pitch values, including excellent extrapolation (WaveGlow); widely-spread F₀ errors (WaveRNN); and systematic generation of the training set median F₀ (LPCNet, Parallel WaveGAN). In contrast, fewer differences between vocoders were observed when using homogeneous train and test sets, thus demonstrating the potential and need for such evaluation to better discriminate the neural vocoders abilities to generate out-of-training-range data.