0:00:15table presents a small footprint multichannel keyword spotting
0:00:19and after presenting on behalf of chewing on art in urine around the
0:00:26this is overview of our presentation
0:00:29but first talk about the motivation of this work
0:00:32then we'll go over three d s t v f r
0:00:36introduced in this paper
0:00:38paul by how that's as figure five
0:00:40but the model architecture
0:00:43procurement star
0:00:45next we will discuss the experimental setup
0:00:48after that we will show some results
0:00:50and cover a final conclusions from this paper
0:00:54finally
0:00:55we will show some future work
0:00:57but we want to
0:00:58as an extra
0:01:03voices distance
0:01:04are increasingly popular
0:01:07keywords that just a google are commonly used to initiate conversation the voice the system
0:01:13keyword spotting low latency becomes the core technological challenge to achieve the task
0:01:20since keywords fires typically run embedded devices such as found speakers limited battery random computer
0:01:26available
0:01:28you want to detect
0:01:30with high accuracy
0:01:32we also want to have small model size all your cost
0:01:36is thus those reasons
0:01:39well to microphone setups are widely used to increase noise robustness
0:01:42so will be interesting to see if we can integrate
0:01:45a noise robustness
0:01:48as far and then do and neural network architecture
0:01:55we first recall std a
0:01:59the svd f idea originated in doing singular value decomposition
0:02:04but a fully connected weight matrix
0:02:08alright one svd s
0:02:10it composes the weight matrix vector dimension
0:02:14as shown in figure
0:02:16these two dimensions the inter but still turns me feature in time domain
0:02:22we refer to the filter the feature domain is how well the first from the
0:02:26in domain data
0:02:31from
0:02:31the feature maps from what they're
0:02:34but first convolved with the one d feature filter also
0:02:41then the output of the current state
0:02:44reporters into memory buffer
0:02:48given memory size
0:02:50but the performance for the past on stage
0:02:55is an states the memory buffer well can vol but the time-domain filter
0:03:00beta
0:03:01to produce the final or
0:03:06this feature vector or start a single us that you have no
0:03:10in practice
0:03:11of course to be several nodes ritual to achieve
0:03:14an dimensionality outputs
0:03:18more a three
0:03:19the first we have is used mecc input layer of speech model
0:03:23but future filter
0:03:24but correspond to transformation of the filterbank train but
0:03:29but and still
0:03:31but contain the past and filtered
0:03:36but should be friends
0:03:38and the time output will correspond to a summary of those past frames
0:03:44as far as shown that that's really a more well single channel
0:03:48in the literature
0:03:52three d s p d f
0:03:54extends that cts existing to dimension
0:03:57but several feature in time to do not
0:04:00channel
0:04:02three d has to be a reference to three dimensions feature time
0:04:06and channel
0:04:09an example where
0:04:10filterbank energies from each channel airfares and it's really are still
0:04:15each channel learns it's weights
0:04:18on its own time and frequency domain for
0:04:22the outputs of all channels are concatenated
0:04:25after the input layer so that was later where is the number and then filter
0:04:29exploiting the time delay redundancy between the two channels
0:04:35really std a can also be considered as applying as idea only channel and simply
0:04:40fusing the results
0:04:44this approach enables the non that we're gonna take advantage the redundancy in the frequency
0:04:49features from each channel
0:04:51but also select the temporal variation cross channel and hence the noise robustness
0:04:57and then approach allows the following learnable's signal filtering module
0:05:03to leverage the multi-channel input
0:05:06but in general
0:05:11but this for
0:05:13an architecture pigeon but the three d s p d f
0:05:18at the input layer
0:05:19to enable noise-robust learning
0:05:22the three d s p d f
0:05:25it's not original features doesn't work
0:05:29and the miss concatenated features for each channel
0:05:33as output
0:05:34and then there immediately follows the three d s ideas
0:05:38and sounds the features from the channels together
0:05:41acting as a filter
0:05:44following the first three ds-cdma
0:05:47there are two models that and further
0:05:50and the decoder
0:05:53but and couldn't fix the filter to really that's really a exemplar
0:05:57and ms softmax probabilities for phonemes and make your
0:06:02i thought
0:06:04encoder consists of a star of std a where
0:06:08but some fully connected bottleneck layers and between interesting yes where
0:06:13to further up the total number of
0:06:15parameters
0:06:17the decoder then case
0:06:19but and better results as an input
0:06:22and in this yes a decision
0:06:24as to whether the utterance and hence the keyword are now
0:06:28but decoder consists of three s first start directly or no bottleneck
0:06:36and's the softmax as the final activation
0:06:41the training also use the frame level cross entropy
0:06:44the trend of the encoder and decoder
0:06:48the final models is a one stage unified moss to jointly trained and you order
0:06:59the experiment setup i'll talk about training and testing data set
0:07:03or training data
0:07:05use two point one million
0:07:07single channel anonymous audio gaining okay go
0:07:10ready to go
0:07:13generate additional year
0:07:14on this mono data
0:07:17we use the multi-style rooms duration but it dual microphone setup
0:07:22the simulations different dimensions
0:07:25and for different microphone spacing
0:07:28or testing data
0:07:30we generate a the humour containing utterances in the following way
0:07:35first use problems randomly and then or and then you words
0:07:39then these problems for spoken by workers volunteers
0:07:43and we recorded but it will microphone set up to convert them into more channel
0:07:49and that harvard set as multi-channel noise we recorded with a similar don't microphone setup
0:07:55as possible
0:07:59table you can see the size of the different testing data set used in this
0:08:03experiment
0:08:07further to evaluate the single channel baseline model and two channel audio
0:08:12we evaluated two different strategies
0:08:15first
0:08:17we will run the keyword detector
0:08:20and i don't channel one or channel zero
0:08:22ignoring the other channel entirely
0:08:26seven will run the single channel keyword spotting
0:08:31on each channel independently
0:08:33given a fixed threshold
0:08:36then we use the both logical or
0:08:39to produce the final result
0:08:40it's in the binary outcome of each channel
0:08:45their strategies or something to evaluate a single channel baseline
0:08:50since three d nested x multi-channel input directly
0:08:54we use the output from three d s figure directly as well
0:08:58we learned from extraction strategies
0:09:02we also evaluate the single channel model
0:09:05but the simple broadside beamformer
0:09:08to capture an improvement from the original signal enhancement back muscles
0:09:13but now present the results it's was performed
0:09:19keeping the false accept rate for x zero point one f a parameter
0:09:24we now present
0:09:25results and false reject
0:09:28as we can see
0:09:30given the same model size
0:09:33or thirty k
0:09:35but proposed to channel model outperforms single-channel baseline
0:09:39and though
0:09:40the queen testing
0:09:42and noisy t v set
0:09:48the relative improvement over the two channel logical or combination baseline
0:09:53it's twenty seven percent
0:09:55and thirty one percent
0:09:57on clean and noisy respectively
0:10:02to further understand this result
0:10:05really
0:10:06create roc curves
0:10:09to confirm the model well quality improvement the various special
0:10:14as we can see our proposed model the best performance
0:10:17compare a single channel and what for all strategies discussed
0:10:21and compare it also provides an informer baseline
0:10:26on the inside
0:10:27the kings firstly really has the idea
0:10:30are small
0:10:32but still nonnegligible
0:10:34and these filtering of three d s video
0:10:37does not seem to interfere performance
0:10:39mcqueen notion
0:10:44for negative set as all some of the channel
0:10:47we hypothesize that some of the gains in the clean so
0:10:50something but not really has really as variables altered
0:10:54producing confusable stage nonnegative set
0:10:58and suppressing for false accept
0:11:03we have seen some such false accept in the past when experimenting with other signal
0:11:07enhancement
0:11:10and noisy test sets
0:11:12against for the three d s p d f are much larger
0:11:16we find that the three s e d r e
0:11:19without performing a baseline single channel model of comparable size
0:11:24even on the baseline also includes basic thing on a technique such as the broadside
0:11:28meeting beamformer
0:11:31which in practice the broad same you are does not seen
0:11:36to add much performance or t k
0:11:41it is difficult noisy set
0:11:45we therefore hypothesize that is larger against noise is that
0:11:49our results of the three d i c d s ability
0:11:52to learn better filtering but the multichannel data
0:11:55and therefore after specific evaluations the intention
0:11:59on the difficult noisy stuff
0:12:01better
0:12:02and the more general classical techniques
0:12:05that just beamforming
0:12:10conclusion this paper has proposed a new architecture for keyword spotting utilizes multichannel microphone input
0:12:17signals to improve detection accuracy
0:12:20but something able to maintain a small model size
0:12:26compared with a single channel baseline which runs in parallel each channel
0:12:30proposed architecture
0:12:32the reduce the false reject rate
0:12:35i twenty seven percent and thirty one percent relative
0:12:38onto a microphone clean and noisy test
0:12:41respectively
0:12:42i don't fix
0:12:43false accept rate
0:12:48as for fusion where
0:12:50those aren't inference in those ideas on how to increase its robustness accurate style
0:12:56for example i'm used references are were you know
0:13:00an adaptive noise cancellation
0:13:04but be interesting to see if we could further integrate
0:13:08it's techniques
0:13:09as part of learnable and only neural network architecture
0:13:16and you this concludes our presentation
0:13:19the small open mode and channel keyword spotter