Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition <BR>(3 minutes introduction)

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition
(3 minutes introduction)

Ruirui Li (Amazon, USA), Chelsea J.-T. Ju (Amazon, USA), Zeya Chen (Amazon, USA), Hongda Mao (Amazon, USA), Oguz Elibol (Amazon, USA), Andreas Stolcke (Amazon, USA)

By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shopping checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of models through an ensemble system to make more reliable predictions. However, any such combined approach has to be robust to incomplete inputs, i.e., when either TD or TI input is missing. As a solution we propose a fusion of embeddings network (FOEnet) architecture, combining joint learning with neural attention. We compare FOEnet with four competitive baseline methods on a dataset of voice assistant inputs, and show that it achieves higher accuracy than the baseline and score fusion methods, especially in the presence of incomplete inputs.

InterSpeech 2021

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition
(3 minutes introduction)

Search in Audio

Related Recordings

Graph-based Label Propagation for Semi-Supervised Speaker Identification
(3 minutes introduction)

Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition
(3 minutes introduction)

InterSpeech 2021

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition (3 minutes introduction)

Search in Audio

Related Recordings

Graph-based Label Propagation for Semi-Supervised Speaker Identification (3 minutes introduction)

Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition (3 minutes introduction)

Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition
(3 minutes introduction)

Graph-based Label Propagation for Semi-Supervised Speaker Identification
(3 minutes introduction)

Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition
(3 minutes introduction)