|Christopher Schymura (Ruhr-Universität Bochum, Germany), Benedikt Bönninghoff (Ruhr-Universität Bochum, Germany), Tsubasa Ochiai (NTT, Japan), Marc Delcroix (NTT, Japan), Keisuke Kinoshita (NTT, Japan), Tomohiro Nakatani (NTT, Japan), Shoko Araki (NTT, Japan), Dorothea Kolossa (Ruhr-Universität Bochum, Germany)|
Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical recurrent neural networks, this paper introduces a novel transformer-based sound event localization framework, where temporal dependencies in the received multi-channel audio signals are captured via self-attention mechanisms. Additionally, the estimated sound event positions are represented as multivariate Gaussian variables, yielding an additional notion of uncertainty, which many previously proposed deep learning-based systems designed for this application do not provide. The framework is evaluated on three publicly available multi-source sound event localization datasets and compared against state-of-the-art methods in terms of localization error and event detection accuracy. It outperforms all competing systems on all datasets with statistical significant differences in performance.