Keyword Transformer: A Self-Attention Model for Keyword Spotting <BR>(3 minutes introduction)

Keyword Transformer: A Self-Attention Model for Keyword Spotting
(3 minutes introduction)

Axel Berg (Arm, UK), Mark O’Connor (Arm, UK), Miguel Tairum Cruz (Arm, UK)

The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.

Search in Audio

Related Recordings

A meta-learning approach for user-defined spoken term classification with varying classes and examples
(3 minutes introduction)

Yangbin Chen , Tom Ko , Jianping Wang

Auxiliary Sequence Labeling Tasks for Disfluency Detection
(3 minutes introduction)

Dongyub Lee , Byeongil Ko , Myeong Cheol Shin , Taesun Whang , Daniel Lee , Eunhwa Kim , Eunggyun Kim , Jaechoon Jo

InterSpeech 2021

Keyword Transformer: A Self-Attention Model for Keyword Spotting (3 minutes introduction)

Search in Audio

Related Recordings

A meta-learning approach for user-defined spoken term classification with varying classes and examples (3 minutes introduction)

Auxiliary Sequence Labeling Tasks for Disfluency Detection (3 minutes introduction)

Keyword Transformer: A Self-Attention Model for Keyword Spotting
(3 minutes introduction)

A meta-learning approach for user-defined spoken term classification with varying classes and examples
(3 minutes introduction)

Auxiliary Sequence Labeling Tasks for Disfluency Detection
(3 minutes introduction)