|Shashi Kumar (Samsung, India), Shakti P. Rath (Reverie Language Technologies, India), Abhishek Pandey (Samsung, India)|
Speaker adaptation is known to provide significant improvement in speech recognition accuracy. However, in practical scenario, only a few seconds of audio is available due to which it may be infeasible to apply speaker adaptation methods such as i-vector and fMLLR robustly. Also, decoding with fMLLR transformation happens in two-passes which is impractical for real-time applications. In recent past, mapping speech features from speaker independent (SI) space to fMLLR normalized space using denoising autoencoder (DA) has been explored. To the best of our knowledge, such mapping generally does not yield consistent improvement. In this paper, we show that our proposed joint VAE based mapping achieves a large improvements over ASR models trained using filterbank SI features. We also show that joint VAE outperforms DA by a large margin. We observe a relative improvement of 17% in word error rate (WER) compared to ASR model trained using filterbank features with i-vectors and 23% without i-vectors.