Deep complementary features for speaker identification in TV broadcast data

Mateusz Budnik, Ali Khodabakhsh, Laurent Besacier, Cenk Demiroglu

This work tries to investigate the use of a Convolutional Neural Network approach and its fusion with more traditional systems such as Total Variability Space for speaker identification in TV broadcast data. The former uses spectrograms for training, while the latter is based on MFCC features. The dataset poses several challenges such as significant class imbalance or background noise and music. Even though the performance of the Convolutional Neural Network is lower than the state-of-the-art, it is able to complement it and give better results through fusion. Different fusion techniques are evaluated using both early and late fusion.

Switch Camera

Odyssey 2016

The Speaker and Language Recognition Workshop

Deep complementary features for speaker identification in TV broadcast data

Search in Audio

Speech Transcript

Related Recordings

First investigations on self trained speaker diarization

Soft VAD in Factor Analysis Based Speaker Segmentation of Broadcast News