Speech Transcript - Small Footprint Multi-channel Keyword Spotting

table presents a small footprint multichannel keyword spotting

and after presenting on behalf of chewing on art in urine around the

this is overview of our presentation

but first talk about the motivation of this work

then we'll go over three d s t v f r

introduced in this paper

paul by how that's as figure five

but the model architecture

procurement star

next we will discuss the experimental setup

after that we will show some results

and cover a final conclusions from this paper

finally

we will show some future work

but we want to

as an extra

voices distance

are increasingly popular

keywords that just a google are commonly used to initiate conversation the voice the system

keyword spotting low latency becomes the core technological challenge to achieve the task

since keywords fires typically run embedded devices such as found speakers limited battery random computer

available

you want to detect

with high accuracy

we also want to have small model size all your cost

is thus those reasons

well to microphone setups are widely used to increase noise robustness

so will be interesting to see if we can integrate

a noise robustness

as far and then do and neural network architecture

we first recall std a

the svd f idea originated in doing singular value decomposition

but a fully connected weight matrix

alright one svd s

it composes the weight matrix vector dimension

as shown in figure

these two dimensions the inter but still turns me feature in time domain

we refer to the filter the feature domain is how well the first from the

in domain data

from

the feature maps from what they're

but first convolved with the one d feature filter also

then the output of the current state

reporters into memory buffer

given memory size

but the performance for the past on stage

is an states the memory buffer well can vol but the time-domain filter

beta

to produce the final or

this feature vector or start a single us that you have no

in practice

of course to be several nodes ritual to achieve

an dimensionality outputs

more a three

the first we have is used mecc input layer of speech model

but future filter

but correspond to transformation of the filterbank train but

but and still

but contain the past and filtered

but should be friends

and the time output will correspond to a summary of those past frames

as far as shown that that's really a more well single channel

in the literature

three d s p d f

extends that cts existing to dimension

but several feature in time to do not

channel

three d has to be a reference to three dimensions feature time

and channel

an example where

filterbank energies from each channel airfares and it's really are still

each channel learns it's weights

on its own time and frequency domain for

the outputs of all channels are concatenated

after the input layer so that was later where is the number and then filter

exploiting the time delay redundancy between the two channels

really std a can also be considered as applying as idea only channel and simply

fusing the results

this approach enables the non that we're gonna take advantage the redundancy in the frequency

features from each channel

but also select the temporal variation cross channel and hence the noise robustness

and then approach allows the following learnable's signal filtering module

to leverage the multi-channel input

but in general

but this for

an architecture pigeon but the three d s p d f

at the input layer

to enable noise-robust learning

the three d s p d f

it's not original features doesn't work

and the miss concatenated features for each channel

as output

and then there immediately follows the three d s ideas

and sounds the features from the channels together

acting as a filter

following the first three ds-cdma

there are two models that and further

and the decoder

but and couldn't fix the filter to really that's really a exemplar

and ms softmax probabilities for phonemes and make your

i thought

encoder consists of a star of std a where

but some fully connected bottleneck layers and between interesting yes where

to further up the total number of

parameters

the decoder then case

but and better results as an input

and in this yes a decision

as to whether the utterance and hence the keyword are now

but decoder consists of three s first start directly or no bottleneck

and's the softmax as the final activation

the training also use the frame level cross entropy

the trend of the encoder and decoder

the final models is a one stage unified moss to jointly trained and you order

the experiment setup i'll talk about training and testing data set

or training data

use two point one million

single channel anonymous audio gaining okay go

ready to go

generate additional year

on this mono data

we use the multi-style rooms duration but it dual microphone setup

the simulations different dimensions

and for different microphone spacing

or testing data

we generate a the humour containing utterances in the following way

first use problems randomly and then or and then you words

then these problems for spoken by workers volunteers

and we recorded but it will microphone set up to convert them into more channel

and that harvard set as multi-channel noise we recorded with a similar don't microphone setup

as possible

table you can see the size of the different testing data set used in this

experiment

further to evaluate the single channel baseline model and two channel audio

we evaluated two different strategies

first

we will run the keyword detector

and i don't channel one or channel zero

ignoring the other channel entirely

seven will run the single channel keyword spotting

on each channel independently

given a fixed threshold

then we use the both logical or

to produce the final result

it's in the binary outcome of each channel

their strategies or something to evaluate a single channel baseline

since three d nested x multi-channel input directly

we use the output from three d s figure directly as well

we learned from extraction strategies

we also evaluate the single channel model

but the simple broadside beamformer

to capture an improvement from the original signal enhancement back muscles

but now present the results it's was performed

keeping the false accept rate for x zero point one f a parameter

we now present

results and false reject

as we can see

given the same model size

or thirty k

but proposed to channel model outperforms single-channel baseline

and though

the queen testing

and noisy t v set

the relative improvement over the two channel logical or combination baseline

it's twenty seven percent

and thirty one percent

on clean and noisy respectively

to further understand this result

really

create roc curves

to confirm the model well quality improvement the various special

as we can see our proposed model the best performance

compare a single channel and what for all strategies discussed

and compare it also provides an informer baseline

on the inside

the kings firstly really has the idea

are small

but still nonnegligible

and these filtering of three d s video

does not seem to interfere performance

mcqueen notion

for negative set as all some of the channel

we hypothesize that some of the gains in the clean so

something but not really has really as variables altered

producing confusable stage nonnegative set

and suppressing for false accept

we have seen some such false accept in the past when experimenting with other signal

enhancement

and noisy test sets

against for the three d s p d f are much larger

we find that the three s e d r e

without performing a baseline single channel model of comparable size

even on the baseline also includes basic thing on a technique such as the broadside

meeting beamformer

which in practice the broad same you are does not seen

to add much performance or t k

it is difficult noisy set

we therefore hypothesize that is larger against noise is that

our results of the three d i c d s ability

to learn better filtering but the multichannel data

and therefore after specific evaluations the intention

on the difficult noisy stuff

better

and the more general classical techniques

that just beamforming

conclusion this paper has proposed a new architecture for keyword spotting utilizes multichannel microphone input

signals to improve detection accuracy

but something able to maintain a small model size

compared with a single channel baseline which runs in parallel each channel

proposed architecture

the reduce the false reject rate

i twenty seven percent and thirty one percent relative

onto a microphone clean and noisy test

respectively

i don't fix

false accept rate

as for fusion where

those aren't inference in those ideas on how to increase its robustness accurate style

for example i'm used references are were you know

an adaptive noise cancellation

but be interesting to see if we could further integrate

it's techniques

as part of learnable and only neural network architecture

and you this concludes our presentation

the small open mode and channel keyword spotter

Small Footprint Multi-channel Keyword Spotting

Speech Application

Jilong Wu, Yiteng Huang, Hyun-Jin Park, Niranjan Subrahmanya, Patrick Violette