AST: Audio Spectrogram Transformer <BR>(3 minutes introduction)

AST: Audio Spectrogram Transformer
(3 minutes introduction)

Yuan Gong (MIT, USA), Yu-An Chung (MIT, USA), James Glass (MIT, USA)

In the past decade, convolutional neural networks (CNNs) have been widely adopted as the main building block for end-to-end audio classification models, which aim to learn a direct mapping from audio spectrograms to corresponding labels. To better capture long-range global context, a recent trend is to add a self-attention mechanism on top of the CNN, forming a CNN-attention hybrid model. However, it is unclear whether the reliance on a CNN is necessary, and if neural networks purely based on attention are sufficient to obtain good performance in audio classification. In this paper, we answer the question by introducing the Audio Spectrogram Transformer (AST), the first convolution-free, purely attention-based model for audio classification. We evaluate AST on various audio classification benchmarks, where it achieves new state-of-the-art results of 0.485 mAP on AudioSet, 95.6% accuracy on ESC-50, and 98.1% accuracy on Speech Commands V2.

Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene
(3 minutes introduction)

Soonshin Seo , Donghyun Lee , Ji-Hwan Kim

InterSpeech 2021

AST: Audio Spectrogram Transformer
(3 minutes introduction)

Search in Audio

Related Recordings

Event Specific Attention for Polyphonic Sound Event Detection
(3 minutes introduction)

Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene
(3 minutes introduction)

InterSpeech 2021

AST: Audio Spectrogram Transformer (3 minutes introduction)

Search in Audio

Related Recordings

Event Specific Attention for Polyphonic Sound Event Detection (3 minutes introduction)

Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene (3 minutes introduction)

AST: Audio Spectrogram Transformer
(3 minutes introduction)

Event Specific Attention for Polyphonic Sound Event Detection
(3 minutes introduction)

Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene
(3 minutes introduction)