|Gautam Bhattacharya and Patrick Kenny
Recently there has significant research interest in using neural networks as feature extractors for text-dependent speaker verification. These types of systems have been shown to perform very well when a large amount of speaker data is available for training. In this work we are interested in testing the efficacy of these methods when only a small amount of training data is available. Google recently introduced an approach that makes use of Recurrent Neural Networks (RNNs) to generate utterance-level or global features for text-dependent speaker verification. This is in contrast to the more established approach of training a Deep Neural Network (DNN) to discriminate between speakers at the frame-level. In this work we explore the DNN (feed forward) and RNN speaker verification paradigms. In the RNN case we propose improvements to the basic model with respect to the small training set available to us. Our experiments show that while both DNNs and RNNs are able to learn the training data, the set used in this study is not large or diverse enough to allow the them to generalize to new speakers. While the DNN models outperform the RNN, both models perform poorly compared to a GMM-UBM system. Nonetheless, we believe this work serves as motivation for the further development of neural network based speaker verification approaches using global features.