InterSpeech 2021

Non-verbal Vocalisation and Laughter Detection using Sequence-to-sequence Models and Multi-label Training
(3 minutes introduction)

Scott Condron (Speech Graphics, UK), Georgia Clarke (Speech Graphics, UK), Anita Klementiev (Speech Graphics, UK), Daniela Morse-Kopp (Speech Graphics, UK), Jack Parry (Speech Graphics, UK), Dimitri Palaz (Speech Graphics, UK)
Non-verbal vocalisations (NVVs) such as laughter are an important part of communication in social interactions and carry important information about a speaker’s state or intention. There remains no clear definition of NVVs and there is no clearly defined protocol for transcribing or detecting NVVs. As such, the standard approach has been to focus on detecting a single NVV such as laughter and map all other NVVs to an “other” class. In this paper we hypothesise that for this task such an approach hurts performance, and that giving more information by using more classes is beneficial. To address this, we present studies using sequence-to-sequence deep neural networks where we include multiple NVV classes rather than mapping them to “other” and allow more than one label per sample. We show that this approach yields better performance than the standard approach on NVV detection. We also evaluate the same model on laughter detection using frame-based and utterance-based metrics and show that the proposed approach yields state-of-the-art performance on the ICSI corpus.