InterSpeech 2020

Successes, Challenges and Opportunities for Speech Technology in Conversational Agents

Shehzad Mevawalla
Amazon Alexa
Abstract From the early days of modern ASR research in the 1990s, one of the driving visions of the field has been a computer-based assistant that could accomplish tasks for the user, simply by being spoken to. Today, we are close to achieving that vision, with a whole array of speech-enabled AI agents eager to help users. Amazon’s Alexa pioneered the AI assistant concept for smart speaker devices enabled by far-field ASR. It currently supports billions of customer interactions per week, on over 100 million devices across multiple languages. This keynote will give an overview of the interplay between underlying speech technologies, including wakeword detection, endpointing, speaker identification, and speech recognition that enable Alexa. We highlight the complexities of combining these technologies into a seamless and robust speech-enabled user experience under large production load and real-time constraints. Interesting algorithmic and engineering challenges arise from choices between deployment in the cloud versus on edge devices, and from constraints on latency and memory versus trade-offs in accuracy. Adapting recognition systems to trending topics, changing domain knowledge bases, and to the customer’s personal catalogs adds additional complexity, as does the need to support adaptive conversational behavior (such as normal versus whispered speech). We also dive into the unique data aspects of large-scale deployments like Alexa, where a continuous stream of unlabeled data enables successful applications of weakly supervised learning. Finally, we highlight problems for the speech research community that remain to be solved before the promise of a fully natural, conversational assistant is fully realized. Shehzad Mevawalla is a Director in Amazon and responsible for automatic speech recognition, speaker recognition and paralinguistics in Alexa world-wide. Recognition from far-field speech input is a key enabling technology for Alexa, and Shehzad and his team work to advance the state of the art in this area for both cloud and edge device. A thirteen-year veteran at Amazon, he has held a variety of senior technical roles, which include supply chain optimization, marketplace trust and safety, and business intelligence, prior to his position with Alexa. Before joining Amazon in 2007, Shehzad was Director of Software at HNC, a company that specialized in financial AI, where he worked on products that used neural networks to detect fraud. Shehzad holds a Master’s degree in Computer Engineering and a Bachelor’s degree in Computer Science, both from the University of Southern California.