Speech Recognition with Next-Generation Kaldi (K2, Lhotse, Icefall)

Sanjeev Khudanpur, Daniel Povey, Piotr Żelasko

This tutorial introduces k2, the cutting-edge successor to the Kaldi speech processing, which consists of several Python-centric modules to enable building speech recognition systems, along with its enabling counterparts, Lhotse and Icefall. The participants will learn how to perform swift data manipulation with Lhotse; how to build and leverage auto-differentiable weighted finite state transducers with k2; and how these two can be combined to create Pytorch-based, state-of-the-art hybrid ASR system recipes from Snowfall, the precursor to Icefall. Dr. Daniel Povey is an expert in ASR, best known as the lead author of the Kaldi toolkit and also for popularizing discriminative training (now known as "sequence training" in the form of MMI and MPE). He has worked in various research positions at IBM, Microsoft and Johns Hopkins University, and is now Chief Speech Scientist of Xiaomi Corporation in Beijing, China. Dr. Piotr Żelasko is an expert in ASR and spoken language understanding, with extensive experience in developing practical and scalable ASR solutions for industrial-strength use. He worked with successful speech processing start-ups - Techmo (Poland) and IntelligentWire (USA, acquired by Avaya). At present, he is a research scientist at Johns Hopkins University. Prof. Sanjeev Khudanpur has 25+ years of experience working on almost all aspects of human language technology, including ASR, machine translation, and information retrieval. He has lead a number of research projects from NSF, DARPA, IARPA, and industry sponsors, and published extensively. He has trained more than 40 PhD and Masters students to use Kaldi for their dissertation work.

InterSpeech 2021

Speech Recognition with Next-Generation Kaldi (K2, Lhotse, Icefall)

Related Recordings

Neural target speech extraction

An Introduction to Automatic Differentiation with Weighted Finite-State Automata