|Georgios Rizos (Imperial College London, UK), Jenna Lawson (Imperial College London, UK), Zhuoda Han (Imperial College London, UK), Duncan Butler (Imperial College London, UK), James Rosindell (Imperial College London, UK), Krystian Mikolajczyk (Imperial College London, UK), Cristina Banks-Leite (Imperial College London, UK), Björn W. Schuller (Imperial College London, UK)|
We study deep bioacoustic event detection through multi-head attention based pooling, exemplified by wildlife monitoring. In the multiple instance learning framework, a core deep neural network learns a projection of the input acoustic signal into a sequence of embeddings, each representing a segment of the input. Sequence pooling is then required to aggregate the information present in the sequence such that we have a single clip-wise representation. We propose an improvement based on Squeeze-and-Excitation mechanisms upon a recently proposed audio tagging ResNet, and show that it performs significantly better than the baseline, as well as a collection of other recent audio models. We then further enhance our model, by performing an extensive comparative study of recent sequence pooling mechanisms, and achieve our best result using multi-head self-attention followed by concatenation of the head-specific pooled embeddings — better than prediction pooling methods, as well as compared to other recent sequence pooling tricks. We perform these experiments on a novel dataset of spider monkey whinny calls we introduce here, recorded in a rainforest in the South-Pacific coast of Costa Rica, with a promising outlook pertaining to minimally invasive wildlife monitoring.