InterSpeech 2021

Shallow Convolution-Augmented Transformer with Differentiable Neural Computer for Low-Complexity Classification of Variable-Length Acoustic Scene
(3 minutes introduction)

Soonshin Seo (Sogang University, Korea), Donghyun Lee (Sogang University, Korea), Ji-Hwan Kim (Sogang University, Korea)
Convolutional neural networks (CNNs) exhibit good performance in low-complexity classification with fixed-length acoustic scenes. However, previous studies have not considered variable-length acoustic scenes in which performance degradation is prevalent. In this regard, we investigate two novel architectures — convolution-augmented transformer (Conformer) and differentiable neural computer (DNC). Both the models show desirable performance for variable-length data but require a large amount of data. In other words, small amounts of data, such as those from acoustic scenes, lead to overfitting in these models. In this paper, we propose a shallow convolution-augmented Transformer with a differentiable neural computer (shallow Conformer-DNC) for the low-complexity classification of variable-length acoustic scenes. The shallow Conformer-DNC is enabled to converge with small amounts of data. Short-term and long-term contexts of variable-length acoustic scenes are trained by using the shallow Conformer and shallow DNC, respectively. The experiments were conducted for variable-length conditions using the TAU Urban Acoustic Scenes 2020 Mobile dataset. As a result, a peak accuracy of 61.25% was confirmed for shallow Conformer-DNC with a model parameter of 34 K. It is comparable performance to state-of-the-art CNNs.