InterSpeech 2015

The technology powering personal digital assistants

Ruhi Sarikaya

We have long envisioned that one day computers will understand natural language and anticipate what we need and when we need it to proactively complete tasks on our behalf. As computers get smaller and more pervasive, how humans interact with them is becoming a crucial issue. Despite numerous attempts over the past 40 years to make language understanding an effective and robust natural user interface for computer interaction, success was limited and scoped to applications that are not particularly central to everyday use. However, advances in speech recognition and machine learning, coupled with the emergence of structured data served by content providers and increased computational power have broadened the application of natural language understanding to a wide spectrum of everyday tasks that are central to the user’s productivity. We believe that as computers become smaller and more ubiquitous (e.g. wearable computers) and as the number of applications increases, both system-initiated and user initiated task completion across various applications and services will become indispensable for personal life management and work productivity. There has been already a tremendous investment in the industry (particularly Microsoft, Google, Apple, Amazon and Nuance) around digital personal assistants during the last couple of years. Each of the major companies in the speech and language technology space has a version of their personal assistants (Cortana, Google Now, Siri, Echo and Dragon, respectively) deployed in production. Yet there is not much talked about these technologies and products in any of the speech and language technology conferences. In this talk, we give an overview of personal digital assistants, describe the system design, architecture and the key components behind them. We will highlight challenges and describe best practices related to the bringing personal assistants from laboratories to the real-world and discuss their potential to fully redefine the human-computer interaction moving forward.