|Harshavardhan Sundar (Amazon, USA), Ming Sun (Amazon, USA), Chao Wang (Amazon, USA)|
The concept of multi-headed self attention (MHSA) introduced as a critical building block of a Transformer Encoder/Decoder Module has made a significant impact in the areas of natural language processing (NLP), automatic speech recognition (ASR) and recently in the area of sound event detection (SED). The current state-of-the-art approaches to SED employ a shared attention mechanism achieved through a stack of MHSA blocks to detect multiple sound events. Consequently, in a multi-label SED task, a common attention mechanism would be responsible for generating relevant feature representations for each of the events to be detected. In this paper, we show through empirical evaluation that having more MHSA blocks dedicated specifically for individual events, rather than having a stack of shared MHSA blocks, improves the overall detection performance. Interestingly, this improvement in performance comes about because the event-specific attention blocks help in resolving confusions in the case of co-occurring events. The proposed “Event-specific Attention Network” (ESA-Net) can be trained in an end-to-end manner. On the DCASE 2020 Task 4 data set, we show that with ESA-Net, the best single model achieves an event-based F1 score of 52.1% on the public validation data set improving over the existing state of the art result.