0:00:17i don't everyone
0:00:18this is trained one from google today and going to talk about personal vad
0:00:23also known as
0:00:24speaker condition the voice activity detection
0:00:27a big part of this work is done by shows doing cool was my in
0:00:31turn as the summer
0:00:34first of all behind them a summary of this work
0:00:37personal vad is the system to detect the voice activity all the target speaker
0:00:42the reason we need a person a vad is that
0:00:45it reduces to you memory and battery consumption for on device speech recognition
0:00:50we implement a person of the at bus the frame that was training detection system
0:00:55which you this kind of speaker embedding as side include
0:00:59i will start by team in some background
0:01:02most of the speech recognition systems
0:01:04are deployed on the crowd
0:01:06but will be asr to the device i'd in the car engine
0:01:10this is because
0:01:11on device asr does not require internet connection integrating reduces the nist e
0:01:16because it does not need to communicate with servers
0:01:20it also preserves the user's privacy better because the audio never used to a device
0:01:26device asr is you really used for smart phones or smart-home speakers
0:01:30for example
0:01:31if you simply want to turn on the flashlight on your full
0:01:35you should be able to do it in a pair open mode
0:01:38if you want to turn on valentine's
0:01:40uses only need access to your local network
0:01:44well as a lung device asr is great
0:01:47there are lots of challenges
0:01:49and x servers
0:01:50we only have a very limited budget of thinking you memory
0:01:54and battery
0:01:55for asr
0:01:56also
0:01:56yes there is no the only program running on the device
0:02:00for example for smart phones there are also many other apps running the background
0:02:05so i important question used
0:02:07when do we run asr on the device apparently
0:02:10it shouldn't be always run
0:02:12but technical solution is to use keyword detection
0:02:15also known as weak or detection
0:02:17well holes were detection
0:02:19for example
0:02:20can you go
0:02:21is the keyword vocal devices
0:02:24because the keyword detection model is usually better is more
0:02:27so it's very cheap
0:02:28and it can be always running
0:02:30and sre security speaker model
0:02:32when sre is very expensive
0:02:34so we only writes
0:02:35when the keyword is detecting
0:02:38however
0:02:39not everyone likes the idea of always having to a speaker that you were
0:02:43before you interact with the device
0:02:45many people wish to be able to directly talk to the device
0:02:48without having to say keyword that we define for that
0:02:52so i alternative solution is to use voice activity detection instead of keyword detection
0:02:57like keyword detection models
0:02:59vad models are also various more
0:03:02and a very cheap to run
0:03:03so you can have the vad model always running
0:03:06and only used asr with vad has been trigger
0:03:11so that we at work
0:03:13the vad model is typically a frame number of binary classifier
0:03:17for every frame of speech signals
0:03:20the idea classifies it into two categories
0:03:22speech and then i speech and after vad
0:03:26with the overall or the non speech frames
0:03:28and only keep the speech frames
0:03:30then we feel be speech frames to downstream components
0:03:34like asr or speaker recognition
0:03:37the recognition results will be used for natural language processing
0:03:40then signal different actions
0:03:43z be model will help us to reject or than a speech frames
0:03:47which will save lots of computational resources
0:03:49but is good enough
0:03:51in a realistic scenario
0:03:53you can talk to the device
0:03:54but you work it can also talk to you and if we wind then you
0:03:58mean room
0:03:58there will be someone talking the t v ads
0:04:01these are all available speech signals
0:04:03still vad will simply accept or this frames
0:04:06but source of the run of the
0:04:09for example
0:04:10if you can the tv plane
0:04:12and the asr case running on this martial us to run out of data
0:04:18so that's why we are introducing personal vad
0:04:22personal vad is similar to standard vad
0:04:24it is the frame level classifier
0:04:27but the difference is that you has three categories instead of two
0:04:31we still have been i speech class
0:04:33but the other to a target speaker speech i don't than typing the speaker speech
0:04:38and it seems that is not spoken by the target speaker
0:04:41like other family members
0:04:43what t v
0:04:44will be considered another target speaker speech
0:04:47the benefits of using personal vad is that
0:04:51we only run asr on congress speaker speech
0:04:54this means we will save lots of computational resources
0:04:57when t v is
0:04:59when there are not go
0:05:00to many members in the user's household or when the user is at the time
0:05:05and to make this two
0:05:06the key is that
0:05:08the personal vad model is to be highly and the fast
0:05:10just like a keyword detection
0:05:12what standard vad model
0:05:14also
0:05:15the false reject must be no
0:05:17because
0:05:17we want to be responsive to the height of the user's request
0:05:21the false accept should also be no
0:05:23to really save the computational resources
0:05:26well we first the release this paper
0:05:28there are some comments at all of this is not a new this is just
0:05:31the speaker recognition or speaker diarization
0:05:34here we want to clarify that
0:05:36no this is not
0:05:37cars not be at the very different speaker recognition or speaker diarization
0:05:42speaker recognition models you really produce recognition results at a reasonable
0:05:46or we don't handle
0:05:48but personal vad produces all scores as frame level
0:05:51it is streaming model and a very sensitive to latency
0:05:55speaker recognition models i can be usually and then use more than five million parameters
0:06:01personal vads are always ready model it must be better is more
0:06:05typically less than two hundred thousand parameters
0:06:08speaker diarization used to cluster and always speakers
0:06:11under the number of speakers is very important
0:06:14"'cause" no baby only cares about the target speaker
0:06:17everyone else will be simply represented as
0:06:19non target speaker
0:06:22i will talk about the implementation of personal vad
0:06:26to implement personal vad
0:06:28the first question use
0:06:29how do we know whom to listen to
0:06:32well there's which systems usually at all the users enrolled her voice
0:06:36and this enrollment is a one of the experience
0:06:38so the cost can be ignored and run time
0:06:41after you romans
0:06:42we will have a speaker embedded
0:06:44what's on the line shows that you vector
0:06:47stored on the device
0:06:48this in banning can be used for speaker recognition
0:06:50well voice your sorry
0:06:52so luxury it can also be used as the side include of course not vad
0:06:58there are different ways of implementing personal vad
0:07:01the simplest the way is to directly combine a standard vad model and the speaker
0:07:06verification system
0:07:07we use this as a baseline
0:07:09but in this paper we propose to accept a new person a vad model
0:07:13which takes the speaker verification score
0:07:16all the speaker in batting as input
0:07:19so actually we implemented for different architectures for personal but i don't going to talk
0:07:24about them one by one
0:07:26first
0:07:27score combination
0:07:28this is the baseline model that i mentioned earlier
0:07:31we don't for adding new model
0:07:33but just use the existing vad model and the speaker verification model
0:07:38if the vad output it's speech
0:07:40we verify this frame
0:07:42okay that the target speaker using the speaker verification model
0:07:45such that we have three different all the classes
0:07:48night personal vad
0:07:50note that
0:07:51this implementation requires running the big speaker verification model at runtime
0:07:56so is expensive solution
0:07:58second one
0:07:59score condition the training here we don't to use the standard vad model
0:08:04but still use the speaker verification model
0:08:07we concatenate of the speaker verification score
0:08:09with the acoustic features and it's and a new personal vad model
0:08:13on top of the concatenated features
0:08:16this is still very expensive because we need to write the speaker verification model at
0:08:20runtime
0:08:23embedding conditioning
0:08:25this is really the implementation that we want to use for a device asr
0:08:29it is directly concatenate the target speaker in the end with acoustic features
0:08:34and we train a new personal vad model on the concatenation of features
0:08:38so the person a vad model
0:08:40is the only model that we need for the runtime
0:08:44and finally score and in bad in addition to send it to concatenate
0:08:49both speaker verification score
0:08:50i think that in
0:08:51with the acoustic features
0:08:53so that use these the most information from the speaker verification system and is supposed
0:08:58to be most powerful
0:09:00but since either requires ran a speaker verification at runtime
0:09:04so it's a still
0:09:05not ideal from device is are
0:09:08okay we have talked about architectures let's talk about the not function
0:09:13vad is a classification problem
0:09:16so standard vad use this binary cross entropy personal vad has three classes so naturally
0:09:22we can use turn we cross entropy
0:09:25but can we do better than cross entropy if you think about the actual use
0:09:30case
0:09:31both non speech and non-target the speaker speech
0:09:34will be discarded of asr
0:09:36so if you make "'em" prediction avril
0:09:38between i speech
0:09:40i do not talking the speaker speech is actually not a big deal
0:09:43we conclude this knowledge you know or loss function
0:09:47and we proposed the weighted pairwise knows
0:09:51it is similar to cross entropy
0:09:53but we use the different the weight for different pairs of classes
0:09:57for example
0:09:58we use us to model weight of zero point one between the cost is nice
0:10:02speech
0:10:02i don't know how the speaker speech
0:10:04and use a larger weight of one into other pairs
0:10:11next i will talk about experiments that have
0:10:15i feel dataset for training and evaluating person vad
0:10:19we have these features
0:10:20it should include real estate and the natural speaker turns
0:10:24it's a couple times worse voice conditions
0:10:27it should have frame level speaker labels and be the should have you roman utterances
0:10:31for each target speaker
0:10:33unfortunately
0:10:34we can find a dataset that satisfies all these requirements
0:10:39so we actually made i artificial dataset based on the well-known you speech data set
0:10:45remember that we need in the frame level speaker labels
0:10:48for each deeply speech utterance we have this variable
0:10:52we also have the ground truth asr transcript
0:10:55so we use of creation asr model to for a nine the ground truth transcript
0:11:00with the audio together that i mean of each word with this timing information
0:11:05we can at the frame level speaker labels
0:11:08and a to have conversational speech we concatenate utterances from different speakers
0:11:14we also used room simulator to add and reverberant noise
0:11:18to the concatenated utterance
0:11:20this will avoid domain over fitting and also be decay the concatenation artifacts
0:11:27clears the model configuration
0:11:29both standard of vad and the person a vad consist of two l s t
0:11:33and there's
0:11:34and the one three collected in a
0:11:36the model has there are point wise renewing parameters in total
0:11:40the speaker verification model has three l s t and there's
0:11:43with projection
0:11:44and the one three collected in a
0:11:46this model is created be
0:11:48with the bass fine tuning parameters
0:11:51for evaluation
0:11:52because this is a classification problem
0:11:55so we use average precision
0:11:57we look at the average precision for each class and also the mean average precision
0:12:02we also look at the metrics for both with and without ourselves any noise these
0:12:08next without any conclusions
0:12:12first
0:12:12we compare different or architectures
0:12:15remember that s t is the baseline by directly combining standard vad
0:12:21and the speaker verification
0:12:23and we find that all the other personal
0:12:25vad models are better than the baseline
0:12:28among the proposed the models as at we see the one that uses both speaker
0:12:33verification score
0:12:34and a speaker in batty is the best
0:12:37this is kind of expected because then use is most the speaker information
0:12:42t is the personal vad model
0:12:44the only uses speaker embedded and this idea of only based asr we note that
0:12:49in t is a slightly worse than std by the different it is more it
0:12:53is near optimal but has only two point six percent of the parameters at runtime
0:12:59we also compare the conventional cross-entropy knows
0:13:02and the proposed a weighted pairwise loss
0:13:05we found that which the powerwise those is consistently better than cross entropy and of
0:13:11the optimal weight between i speech
0:13:13and i have a speaker speech is zero point one
0:13:17finally since the out medical personnel vad
0:13:21is to replace the standard a vad
0:13:23so we compare that you understand alleviated task
0:13:26in some cases person of at is slightly worse
0:13:30but the differences are based more
0:13:33so conclusions of this paper
0:13:35the proposed person the vad architectures outperform the baseline of directly combining vad and the
0:13:42speaker verification
0:13:43among the proposed architectures
0:13:45ask at has the best the performance but e t is the idea one for
0:13:50on device asr
0:13:51which has near optimal performance
0:13:54we also propose weighted pairwise knows
0:13:57with all performance cross entropy knows
0:13:59finally person the vad understand a vad perform almost you could well a standard vad
0:14:05tasks
0:14:07and also briefly talk about the future work directions
0:14:11currently the person of eighteen model is trained and evaluated on artificial computations
0:14:17we should really used realistic conversational speech
0:14:20this will require those of the data collection and anybody efforts
0:14:24besides person the vad can be used the was speaker diarization
0:14:28especially whether there is the overlap of the speech in the conversation
0:14:32and the good news is that people are already doing you'd
0:14:35researchers from russia propose to this system known as having the speaker vad
0:14:41which is similar to personal vad
0:14:43and the successfully used it for speaker their addition
0:14:46if you know our paper
0:14:47i would recommend the usual with their paper as well
0:14:51if you have actions
0:14:52pretty c is a common
0:14:54on the speaker all these features are then the time t website and our paper
0:14:58some two