Synchronising speech segments with musical beats in Mandarin and English singing <BR>(3 minutes introduction)

Synchronising speech segments with musical beats in Mandarin and English singing
(3 minutes introduction)

Cong Zhang (Radboud Universiteit, The Netherlands), Jian Zhu (University of Michigan, USA)

Generating synthesised singing voice with models trained on speech data has many advantages due to the models’ flexibility and controllability. However, since the information about the temporal relationship between segments and beats are lacking in speech training data, the synthesised singing may sound off-beat at times. Therefore, the availability of the information on the temporal relationship between speech segments and music beats is crucial. The current study investigated the segment-beat synchronisation in singing data, with hypotheses formed based on the linguistics theories of P-centre and sonority hierarchy. A Mandarin corpus and an English corpus of professional singing data were manually annotated and analysed. The results showed that the presence of musical beats was more dependent on segment duration than sonority. However, the sonority hierarchy and the P-centre theory were highly related to the location of beats. Mandarin and English demonstrated cross-linguistic variations despite exhibiting common patterns.

InterSpeech 2021

Synchronising speech segments with musical beats in Mandarin and English singing
(3 minutes introduction)

Search in Audio

Related Recordings

Unsupervised Training of a DNN-based Formant Tracker
(longer introduction)

Pitch contour separation from overlapping speech
(3 minutes introduction)

InterSpeech 2021

Synchronising speech segments with musical beats in Mandarin and English singing (3 minutes introduction)

Search in Audio

Related Recordings

Unsupervised Training of a DNN-based Formant Tracker (longer introduction)

Pitch contour separation from overlapping speech (3 minutes introduction)

Synchronising speech segments with musical beats in Mandarin and English singing
(3 minutes introduction)

Unsupervised Training of a DNN-based Formant Tracker
(longer introduction)

Pitch contour separation from overlapping speech
(3 minutes introduction)