i'm not everyone

this is trained one from google

today and going to talk about personal vad was on the line shows

speaker condition the voice activity detection

a big part of this work is done by shows

cool was my internist the summer

first of all behind them a summary of this work

personal vad is the system to detect the voice activity all the target speaker

the reason we need a personal vad is that

it reduces gpu memory and battery consumption for on device speech recognition

we implement person of the at

but as a frame that was training detection system

which you this kind of speaker embedding as side include

i will start by team in some background

most of the speech recognition systems

are deployed on the crowd

but will be asr to the device i'd in the car engine

this is because

on device asr does not require internet connection integrating reduces the nist e

because it does not need to communicate with servers

it also preserves the user's privacy better because the audio never use the device

device asr is your used for smart phones or smart-home speakers for example

if you simply want to turn around the flashlight audio file

you should be able to do it in any pair open mode

if you want to turn on valentine's

use only need access to your local network

although i'm device asr discrete

there are lots of challenges

and x servers

we only have a very limited budget of thinking you memory

and the battery for asr

also

yes there is no the only program running on the device

for example for smart phones there are also many r s running the background

so i important question is

when do we run asr on the device apparently

it shouldn't be always run

but technical solution is to use keyword detection

well so no it was recorded detection

well holes were detection

for example critical go

is the keyword vocal devices

because the keyword detection model is usually better is more

so it's very cheap and it can be always running

and sre security a speaker model

when s r is very expensive

so we only writes

when the keyword list exactly

however not everyone likes the idea of always having to a speaker that you were

before you interact with the device many people wish to be able to be directly

talk to the device without having to say keyword data we define for that

so i alternative solution is to use voice activity detection instead of keyword detection

like keyword detection one does

vad models are also various more

and a very cheap to run

so you can have the vad model always running

and only used asr with vad has been trigger

so that we at work

the vad model is typically a frame number of binary classifier

for every frame of speech signals

the at classifies it into two categories

speech and then i speech and after vad

with the overall or the non speech frames

and only keep the speech frames

then we feel be speech frames to downstream components like asr or speaker recognition

the recognition results will be used for natural language processing

then speaker different actions

z be model will help us to reject or than i speech frames

which will save lots of computational resources

but is difficult enough

in a realistic scenario you can talk to the device

but you work it can also talk to you and if we wind then you

mean room there will be someone talking the t v at

these are all available speech signals still vad will simply accept or this frames but

source of the run over the

for example

if you can the tv plane

and the asr case running on this martial us to read out of data

so that's why we are introducing personal vad

personal vad is similar to the standard vad

it is the frame level classifier

but the difference is that you has three categories instead of two

we still have been i speech class

but the other to a target speaker speech

i don't i'm typing the speaker speech

i don't see that is not spoken by the target speaker

like other family members

what t v

will be considered another target speaker speech

the benefits of using personal vad is that

we are only right yes are on the speaker speech

this means

we will save lots of computational resources

wouldn't t v is on whether there are not go

turn t members in the user's household

or when the user is ad hoc

and to make this to the key is that

the personal vad model is to be tidy and the fast

just like a keyword detection

well standard vad model

well so

the false reject must be no

because

we want to be responsive to the height of the user's request

the full extent should also be no

to really save the computational resources

well we first the release this paper

there are some common thing all of this is not a new this is just

the speaker recognition or speaker diarization

here we want to clarify that

no this is not

personal be at the very different speaker recognition or speaker diarization

speaker recognition models you really produce recognition results at a reasonable

or we don't at all

but personal vad produces all scores as frame level

it is us to me model and a very sensitive to latency

speaker recognition models are typically be

usually at the nist more than five million parameters

personal vads are always ready model it must be better is more typically less than

two hundred thousand parameters

speaker diarization used to cluster and always speakers

under the number of speakers is very important

"'cause" no baby only cares about the target speaker

everyone else will be simply represented as

non target speaker

i will talk about the implementation of personal vad

to implement personal vad

the first question use

how do we know whom to listen to

well there's which systems usually at all the users enrolled her voice

and this enrollment is a one of the experience

so the cost the can be ignored and run time

after enrollment

we will have a speaker embedded

also no it was that you vector

stored on the device

this in banning can be used for speaker recognition

well voice usually so luxury it can also be used as the side include of

personal vad

there are different ways of implementing personal vad

the simplest the way is to directly combine a standard vad model and the speaker

verification system

we use this as a baseline

but in this paper

we propose to explain a new person a vad model

which takes the speaker verification score

or the speaker in batting include

so actually we implemented for different architectures for personal vad

i don't going to talk about than one by one

first

score combination this is the baseline model that i mentioned earlier

we don't for adding you model but just use the existing vad model and the

speaker verification model

if the vad output if the speech

we verify this frame

okay that the target speaker using the speaker verification model such that we have three

different all the classes

like personal vad

note that

this implementation requires running the big speaker verification model at runtime

so is expensive solution

second one

score condition the training

here we don't to use the standard vad model

but still use the speaker verification model

we concatenate of the speaker verification score

with the acoustic features

and it's and a new personal vad model

on top of the concatenated features

this is still very expensive because we need to run a speaker verification model at

runtime

embedding conditioning

this is really the implementation that we want to use for a device asr

it directly concatenate the type a speaker in that in with acoustic features

and we train a new personal vad model on the concatenation of features

so the personal vad model is the only model that we need for the runtime

and finally

score and in bad condition mission it concatenate

both speaker verification score

i think that in

with the acoustic features

so that use these the most information from the speaker verification system and is supposed

to be most powerful

but since either requires ran a speaker verification at runtime

so it's a still not ideal from device is are

okay we have talked about architectures

let's talk about the most functions

vad is a classification problem

so standard vad use this binary cross entropy

there is no vad has three classes so naturally

we can use turner we cross entropy

but

come with a better than cross entropy

if you think about the actual use case

both non speech

and non-target the speaker speech

will be discarded of asr

so if you make a prediction error

between i speech

i do not talking the speaker speech is actually not a big deal

we conclude this knowledge you know or loss function

and we proposed the weighted pairwise knows

it is similar to cross entropy

but we use the different the weight for different pairs of classes

for example we use a smaller weight of zero point one between the cost is

nice speech

i do not have been the speaker speech

and use a larger weight of one into other pairs

best

i will talk about experiments that have

i feel dataset for training and evaluating person vad

we have these features

it should include real estate and the natural speaker turns

it is the colour drivers voice conditions

it should have frame level speaker labels

finally the should have you roman utterances

for each target speaker

unfortunately

we can find a dataset that satisfies all these requirements

so we actually made i artificial dataset based on the well-known you speech data set

remember that we need in the frame level speaker labels

for each and every speech utterance

we have this you are able

we also have the ground truth asr transcript

so we use of creation asr model

to for a nine the ground truth transcript

with the audio

together timing of each word

we just timing information

we get the frame level speaker labels

and a to have conversational speech

we concatenate utterances from different speakers

we also used room simulator

to add a reverberant noise to the concatenated utterance

this will avoid domain over fitting and also be decay the concatenation artifacts

clears the model configuration

both standard vad and the person of vad consist of two l s t and

there's

and the one three collected in a

the model has their point one three million parameters in total

the speaker verification model has three l s t and there's

with projection and the one three collected in a

this model is created be

with the bass fine tuning parameters

for evaluation

because this is a classification problem so we use average precision

we look at the average precision for each class and also the mean average precision

we also look at the metrics for both with and without ourselves any noise these

next results and the conclusions

first

we compare different or architectures

remember that

s c is the baseline by directly combining standard vad

and the speaker verification

and we find that all the other personal vad models are better than the baseline

along the proposed the models

as at

we see the one that the use this for speaker verification score and a speaker

in batty is the best

this is kind of expected because then use is most the speaker information

t is the personal vad model

the only uses speaker embedding and this idea of only based asr

we note that in t is a slightly worse than std

by the different it is more it is near optimal but has only two point

six percent of the parameters at runtime

we also compare the conventional cross-entropy knows

and the proposed a weighted pairwise novels

we found that

which the powerwise those is consistently better

no cross entropy and of the optimal weight between i speech

and i have a speaker speech is there a point one

finally since the out medical personnel vad is to replace the standard vad so we

compare that you understand alleviated task in some cases

person of at is slightly worse

by the differences are by some more

so conclusions of this paper

the proposed person the vad architectures

outperforms the baseline of directly combining vad and the speaker verification

among the proposed architectures as at has the best performance

but e t is the idea one for on device asr

which has near optimal performance

we also propose weighted pairwise knows

which outperforms cross entropy knows

finally person the vad understand a vad perform almost you could well a standard vad

tasks

and also briefly talk about the future work directions

currently the person of eighteen model is trained and evaluated on artificial computations

we for the really use

realistic conversational speech

this will require also the data collection and the neighboring efforts

besides

person the vad can be used the was speaker diarization

especially whether there is the overlap of the speech in the conversation

and the good news is that

people are already we used

researchers from russia propose to this system known as having the speaker vad

which is similar to personal vad

and the successfully used it for speaker their addition

if you know our paper

i would recommend the usual with their paper as well

if you have at questions

pretty c d's a comment on the speaker all these features are then the time

t website and our paper

seven q