0:00:17i'm not everyone
0:00:18this is trained one from google
0:00:20today and going to talk about personal vad was on the line shows
0:00:24speaker condition the voice activity detection
0:00:27a big part of this work is done by shows
0:00:30cool was my internist the summer
0:00:34first of all behind them a summary of this work
0:00:37personal vad is the system to detect the voice activity all the target speaker
0:00:42the reason we need a personal vad is that
0:00:45it reduces gpu memory and battery consumption for on device speech recognition
0:00:50we implement person of the at
0:00:52but as a frame that was training detection system
0:00:55which you this kind of speaker embedding as side include
0:00:59i will start by team in some background
0:01:02most of the speech recognition systems
0:01:04are deployed on the crowd
0:01:06but will be asr to the device i'd in the car engine
0:01:10this is because
0:01:11on device asr does not require internet connection integrating reduces the nist e
0:01:16because it does not need to communicate with servers
0:01:20it also preserves the user's privacy better because the audio never use the device
0:01:26device asr is your used for smart phones or smart-home speakers for example
0:01:31if you simply want to turn around the flashlight audio file
0:01:35you should be able to do it in any pair open mode
0:01:38if you want to turn on valentine's
0:01:40use only need access to your local network
0:01:44although i'm device asr discrete
0:01:47there are lots of challenges
0:01:49and x servers
0:01:50we only have a very limited budget of thinking you memory
0:01:54and the battery for asr
0:01:56also
0:01:56yes there is no the only program running on the device
0:02:00for example for smart phones there are also many r s running the background
0:02:05so i important question is
0:02:07when do we run asr on the device apparently
0:02:10it shouldn't be always run
0:02:12but technical solution is to use keyword detection
0:02:15well so no it was recorded detection
0:02:17well holes were detection
0:02:19for example critical go
0:02:21is the keyword vocal devices
0:02:24because the keyword detection model is usually better is more
0:02:27so it's very cheap and it can be always running
0:02:30and sre security a speaker model
0:02:32when s r is very expensive
0:02:34so we only writes
0:02:35when the keyword list exactly
0:02:38however not everyone likes the idea of always having to a speaker that you were
0:02:42before you interact with the device many people wish to be able to be directly
0:02:47talk to the device without having to say keyword data we define for that
0:02:52so i alternative solution is to use voice activity detection instead of keyword detection
0:02:57like keyword detection one does
0:02:59vad models are also various more
0:03:02and a very cheap to run
0:03:03so you can have the vad model always running
0:03:06and only used asr with vad has been trigger
0:03:11so that we at work
0:03:13the vad model is typically a frame number of binary classifier
0:03:17for every frame of speech signals
0:03:20the at classifies it into two categories
0:03:22speech and then i speech and after vad
0:03:26with the overall or the non speech frames
0:03:28and only keep the speech frames
0:03:30then we feel be speech frames to downstream components like asr or speaker recognition
0:03:37the recognition results will be used for natural language processing
0:03:40then speaker different actions
0:03:43z be model will help us to reject or than i speech frames
0:03:47which will save lots of computational resources
0:03:49but is difficult enough
0:03:51in a realistic scenario you can talk to the device
0:03:54but you work it can also talk to you and if we wind then you
0:03:58mean room there will be someone talking the t v at
0:04:01these are all available speech signals still vad will simply accept or this frames but
0:04:07source of the run over the
0:04:09for example
0:04:10if you can the tv plane
0:04:12and the asr case running on this martial us to read out of data
0:04:18so that's why we are introducing personal vad
0:04:22personal vad is similar to the standard vad
0:04:24it is the frame level classifier
0:04:27but the difference is that you has three categories instead of two
0:04:31we still have been i speech class
0:04:33but the other to a target speaker speech
0:04:36i don't i'm typing the speaker speech
0:04:38i don't see that is not spoken by the target speaker
0:04:41like other family members
0:04:43what t v
0:04:44will be considered another target speaker speech
0:04:47the benefits of using personal vad is that
0:04:51we are only right yes are on the speaker speech
0:04:54this means
0:04:55we will save lots of computational resources
0:04:57wouldn't t v is on whether there are not go
0:05:00turn t members in the user's household
0:05:02or when the user is ad hoc
0:05:05and to make this to the key is that
0:05:08the personal vad model is to be tidy and the fast
0:05:10just like a keyword detection
0:05:12well standard vad model
0:05:14well so
0:05:15the false reject must be no
0:05:17because
0:05:17we want to be responsive to the height of the user's request
0:05:21the full extent should also be no
0:05:23to really save the computational resources
0:05:26well we first the release this paper
0:05:28there are some common thing all of this is not a new this is just
0:05:31the speaker recognition or speaker diarization
0:05:34here we want to clarify that
0:05:36no this is not
0:05:37personal be at the very different speaker recognition or speaker diarization
0:05:42speaker recognition models you really produce recognition results at a reasonable
0:05:46or we don't at all
0:05:48but personal vad produces all scores as frame level
0:05:51it is us to me model and a very sensitive to latency
0:05:55speaker recognition models are typically be
0:05:58usually at the nist more than five million parameters
0:06:01personal vads are always ready model it must be better is more typically less than
0:06:06two hundred thousand parameters
0:06:08speaker diarization used to cluster and always speakers
0:06:11under the number of speakers is very important
0:06:14"'cause" no baby only cares about the target speaker
0:06:17everyone else will be simply represented as
0:06:19non target speaker
0:06:22i will talk about the implementation of personal vad
0:06:26to implement personal vad
0:06:28the first question use
0:06:29how do we know whom to listen to
0:06:32well there's which systems usually at all the users enrolled her voice
0:06:36and this enrollment is a one of the experience
0:06:38so the cost the can be ignored and run time
0:06:41after enrollment
0:06:42we will have a speaker embedded
0:06:44also no it was that you vector
0:06:47stored on the device
0:06:48this in banning can be used for speaker recognition
0:06:50well voice usually so luxury it can also be used as the side include of
0:06:55personal vad
0:06:58there are different ways of implementing personal vad
0:07:01the simplest the way is to directly combine a standard vad model and the speaker
0:07:06verification system
0:07:07we use this as a baseline
0:07:09but in this paper
0:07:10we propose to explain a new person a vad model
0:07:13which takes the speaker verification score
0:07:16or the speaker in batting include
0:07:19so actually we implemented for different architectures for personal vad
0:07:23i don't going to talk about than one by one
0:07:26first
0:07:27score combination this is the baseline model that i mentioned earlier
0:07:31we don't for adding you model but just use the existing vad model and the
0:07:36speaker verification model
0:07:38if the vad output if the speech
0:07:40we verify this frame
0:07:42okay that the target speaker using the speaker verification model such that we have three
0:07:47different all the classes
0:07:48like personal vad
0:07:50note that
0:07:51this implementation requires running the big speaker verification model at runtime
0:07:56so is expensive solution
0:07:58second one
0:07:59score condition the training
0:08:01here we don't to use the standard vad model
0:08:04but still use the speaker verification model
0:08:07we concatenate of the speaker verification score
0:08:09with the acoustic features
0:08:11and it's and a new personal vad model
0:08:13on top of the concatenated features
0:08:16this is still very expensive because we need to run a speaker verification model at
0:08:20runtime
0:08:23embedding conditioning
0:08:25this is really the implementation that we want to use for a device asr
0:08:29it directly concatenate the type a speaker in that in with acoustic features
0:08:34and we train a new personal vad model on the concatenation of features
0:08:38so the personal vad model is the only model that we need for the runtime
0:08:44and finally
0:08:45score and in bad condition mission it concatenate
0:08:49both speaker verification score
0:08:50i think that in
0:08:51with the acoustic features
0:08:53so that use these the most information from the speaker verification system and is supposed
0:08:58to be most powerful
0:09:00but since either requires ran a speaker verification at runtime
0:09:04so it's a still not ideal from device is are
0:09:08okay we have talked about architectures
0:09:11let's talk about the most functions
0:09:13vad is a classification problem
0:09:16so standard vad use this binary cross entropy
0:09:19there is no vad has three classes so naturally
0:09:22we can use turner we cross entropy
0:09:25but
0:09:26come with a better than cross entropy
0:09:28if you think about the actual use case
0:09:31both non speech
0:09:32and non-target the speaker speech
0:09:34will be discarded of asr
0:09:36so if you make a prediction error
0:09:38between i speech
0:09:40i do not talking the speaker speech is actually not a big deal
0:09:43we conclude this knowledge you know or loss function
0:09:47and we proposed the weighted pairwise knows
0:09:51it is similar to cross entropy
0:09:53but we use the different the weight for different pairs of classes
0:09:57for example we use a smaller weight of zero point one between the cost is
0:10:01nice speech
0:10:02i do not have been the speaker speech
0:10:04and use a larger weight of one into other pairs
0:10:11best
0:10:11i will talk about experiments that have
0:10:15i feel dataset for training and evaluating person vad
0:10:19we have these features
0:10:20it should include real estate and the natural speaker turns
0:10:24it is the colour drivers voice conditions
0:10:27it should have frame level speaker labels
0:10:29finally the should have you roman utterances
0:10:31for each target speaker
0:10:33unfortunately
0:10:34we can find a dataset that satisfies all these requirements
0:10:39so we actually made i artificial dataset based on the well-known you speech data set
0:10:45remember that we need in the frame level speaker labels
0:10:48for each and every speech utterance
0:10:50we have this you are able
0:10:52we also have the ground truth asr transcript
0:10:55so we use of creation asr model
0:10:58to for a nine the ground truth transcript
0:11:00with the audio
0:11:01together timing of each word
0:11:03we just timing information
0:11:05we get the frame level speaker labels
0:11:08and a to have conversational speech
0:11:11we concatenate utterances from different speakers
0:11:14we also used room simulator
0:11:16to add a reverberant noise to the concatenated utterance
0:11:20this will avoid domain over fitting and also be decay the concatenation artifacts
0:11:27clears the model configuration
0:11:29both standard vad and the person of vad consist of two l s t and
0:11:33there's
0:11:34and the one three collected in a
0:11:36the model has their point one three million parameters in total
0:11:40the speaker verification model has three l s t and there's
0:11:43with projection and the one three collected in a
0:11:46this model is created be
0:11:48with the bass fine tuning parameters
0:11:51for evaluation
0:11:52because this is a classification problem so we use average precision
0:11:57we look at the average precision for each class and also the mean average precision
0:12:02we also look at the metrics for both with and without ourselves any noise these
0:12:08next results and the conclusions
0:12:12first
0:12:12we compare different or architectures
0:12:15remember that
0:12:17s c is the baseline by directly combining standard vad
0:12:21and the speaker verification
0:12:23and we find that all the other personal vad models are better than the baseline
0:12:28along the proposed the models
0:12:30as at
0:12:31we see the one that the use this for speaker verification score and a speaker
0:12:35in batty is the best
0:12:37this is kind of expected because then use is most the speaker information
0:12:42t is the personal vad model
0:12:44the only uses speaker embedding and this idea of only based asr
0:12:48we note that in t is a slightly worse than std
0:12:52by the different it is more it is near optimal but has only two point
0:12:56six percent of the parameters at runtime
0:12:59we also compare the conventional cross-entropy knows
0:13:02and the proposed a weighted pairwise novels
0:13:05we found that
0:13:06which the powerwise those is consistently better
0:13:09no cross entropy and of the optimal weight between i speech
0:13:13and i have a speaker speech is there a point one
0:13:17finally since the out medical personnel vad is to replace the standard vad so we
0:13:23compare that you understand alleviated task in some cases
0:13:28person of at is slightly worse
0:13:30by the differences are by some more
0:13:33so conclusions of this paper
0:13:35the proposed person the vad architectures
0:13:38outperforms the baseline of directly combining vad and the speaker verification
0:13:43among the proposed architectures as at has the best performance
0:13:48but e t is the idea one for on device asr
0:13:51which has near optimal performance
0:13:54we also propose weighted pairwise knows
0:13:57which outperforms cross entropy knows
0:13:59finally person the vad understand a vad perform almost you could well a standard vad
0:14:05tasks
0:14:07and also briefly talk about the future work directions
0:14:11currently the person of eighteen model is trained and evaluated on artificial computations
0:14:17we for the really use
0:14:18realistic conversational speech
0:14:20this will require also the data collection and the neighboring efforts
0:14:24besides
0:14:25person the vad can be used the was speaker diarization
0:14:28especially whether there is the overlap of the speech in the conversation
0:14:32and the good news is that
0:14:34people are already we used
0:14:35researchers from russia propose to this system known as having the speaker vad
0:14:41which is similar to personal vad
0:14:43and the successfully used it for speaker their addition
0:14:46if you know our paper
0:14:47i would recommend the usual with their paper as well
0:14:51if you have at questions
0:14:52pretty c d's a comment on the speaker all these features are then the time
0:14:56t website and our paper
0:14:58seven q