InterSpeech 2021

Vocal Harmony Separation using Time-domain Neural Networks
(longer introduction)

Saurjya Sarkar (Queen Mary University of London, UK), Emmanouil Benetos (Queen Mary University of London, UK), Mark Sandler (Queen Mary University of London, UK)
Polyphonic vocal recordings are an inherently challenging source separation task due to the melodic structure of the vocal parts and unique timbre of its constituents. In this work we utilise a time-domain neural network architecture re-purposed from speech separation research and modify it to separate a capella mixtures at a high sampling rate. We use four-part (soprano, alto, tenor and bass) a capella recordings of Bach Chorales and Barbershop Quartets for our experiments. Unlike current deep learning based choral separation models where the training objective is to separate constituent sources based on their class, we train our model using a permutation invariant objective. Using this we achieve state-of-the-art results for choral music separation. We introduce a novel method to estimate harmonic overlap between sung musical notes as a measure of task complexity. We also present an analysis of the impact of randomised mixing, input lengths and filterbank lengths for our task. Our results show a moderate negative correlation between the harmonic overlap of the target sources and source separation performance. We report that training our models with randomly mixed musically-incoherent mixtures drastically reduces the performance of vocal harmony separation as it decreases the average harmonic overlap presented during training.